Константы и подключение библиотек¶
%load_ext autoreload
%autoreload 2
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.io as pio
import plotly.express as px
import hdbscan
import shap
from bokeh.plotting import curdoc
import lightgbm as lgb
import catboost as cb
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
from functools import partial
from utils.distribution import get_df_info, DistributionPlotter
from utils.reductors import make_tsne, make_umap
from utils.drplotter import DimReductionPlotter
from utils.lgbm import plot_feature_info, plot_scores, plot_tree_info
sns.set_style("dark")
plt.style.use("dark_background")
pio.templates.default = "plotly_dark"
curdoc().theme = "dark_minimal"
shap.initjs()
WEBSTAT_DATASET_PATH = "./datasets/t1_webstat.csv"
TRAIN_DATASET_PATH = "./datasets/train.csv"
TEST_DATASET_PATH = "./datasets/test.csv"
SUBMISSION_PATH = "./datasets/submission.csv"
SAMPLE_SUBMISSION_PATH = "./datasets/sample_submission.csv"
Анализ датасетов¶
Webstat¶
Первичный осмотр¶
Отсортируем сразу по sessionkey_id и pageview_number
web = pd.read_csv(WEBSTAT_DATASET_PATH)
web["date_time"] = pd.to_datetime(web["date_time"])
web = web.sort_values(["sessionkey_id", "date_time"])
web.head()
| sessionkey_id | date_time | page_type | pageview_number | pageview_duration_sec | category_id | model_id | good_id | price | product_in_sale | |
|---|---|---|---|---|---|---|---|---|---|---|
| 2268917 | 109996122 | 1975-10-17 13:42:56.953 | 2 | 1 | 11.0 | 722.0 | NaN | NaN | NaN | NaN |
| 2268918 | 109996122 | 1975-10-17 13:43:07.510 | 2 | 2 | 22.0 | 7196.0 | NaN | NaN | NaN | NaN |
| 2268919 | 109996122 | 1975-10-17 13:43:29.860 | 2 | 3 | 25.0 | 779.0 | NaN | NaN | NaN | NaN |
| 2269206 | 109996122 | 1975-10-17 13:43:54.757 | 2 | 4 | 9.0 | 7196.0 | NaN | NaN | NaN | NaN |
| 2267445 | 109996122 | 1975-10-17 13:44:03.803 | 2 | 5 | 11.0 | 723.0 | NaN | NaN | NaN | NaN |
get_df_info(web)
| dtype | nunique | nan | zero | empty string | example(-s) | mode, mode proportion | trash_score | |
|---|---|---|---|---|---|---|---|---|
| product_in_sale | float64 | 2 | n: 0.633 | NaN | NaN | (1.0, nan) | (1.0, 1.0) | 1.000 |
| good_id | float64 | 233144 | n: 0.633 | NaN | NaN | (57794032.0, 66632395.0) | (66921494.0, 0.001) | 0.633 |
| price | float64 | 12299 | n: 0.633 | NaN | NaN | (59.0, 12481.0) | (952.0, 0.004) | 0.633 |
| model_id | float64 | 181760 | n: 0.613 | NaN | NaN | (19237096.0, 1734006.0) | (18340251.0, 0.002) | 0.613 |
| category_id | float64 | 3549 | n: 0.294 | NaN | NaN | (4012.0, 4553.0) | (155.0, 0.054) | 0.294 |
| pageview_duration_sec | float64 | 2975 | n: 0.088 | z: 0.006 | NaN | (-13608.0, -6658.0) | (9.0, 0.025) | 0.094 |
| sessionkey_id | int64 | 328430 | NaN | NaN | NaN | (113210921, 117494105) | (119635649.0, 0.0) | NaN |
| date_time | datetime64[ns] | 3329535 | NaN | NaN | NaN | (1975-12-24 18:26:33.407000, 1975-12-17 23:03:... | (1976-01-25 22:35:55.557000, 0.0) | NaN |
| page_type | int64 | 13 | NaN | NaN | NaN | (11, 8) | (1.0, 0.387) | NaN |
| pageview_number | int64 | 632 | NaN | NaN | NaN | (329, 248) | (1.0, 0.097) | NaN |
Куча нанов, неприятно :/
Посмотрим на наны в последних колонках¶
for column in ("category_id", "model_id", "good_id", "price", "product_in_sale"):
print(column)
print(web[~web[column].isna()]["page_type"].value_counts())
print()
category_id page_type 1 1289570 2 932848 4 133137 Name: count, dtype: int64 model_id page_type 1 1289578 Name: count, dtype: int64 good_id page_type 1 1225243 Name: count, dtype: int64 price page_type 1 1225243 Name: count, dtype: int64 product_in_sale page_type 1 1225243 Name: count, dtype: int64
Ух ты, можно сделать вывод, что page_type = 1 это страница с товаром!
web[(web["page_type"] == 1)]
| sessionkey_id | date_time | page_type | pageview_number | pageview_duration_sec | category_id | model_id | good_id | price | product_in_sale | |
|---|---|---|---|---|---|---|---|---|---|---|
| 2268628 | 110019268 | 1975-10-17 15:27:58.257 | 1 | 2 | 43.0 | 206.0 | 8748965.0 | 22312252.0 | 2986.0 | 1.0 |
| 2268629 | 110020180 | 1975-10-17 15:29:52.147 | 1 | 1 | NaN | 147.0 | 1513237.0 | 55614318.0 | 4490.0 | 1.0 |
| 2269208 | 110040418 | 1975-10-17 17:05:41.530 | 1 | 1 | 25.0 | 1200.0 | 1827718.0 | 10547740.0 | 726.0 | 1.0 |
| 2268920 | 110040418 | 1975-10-17 17:06:06.163 | 1 | 2 | 43.0 | 1200.0 | 1827718.0 | 10547740.0 | 726.0 | 1.0 |
| 2267447 | 110040418 | 1975-10-17 17:07:45.243 | 1 | 6 | 55.0 | 1200.0 | 14122715.0 | 28114543.0 | 430.0 | 1.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2267442 | 134628743 | 1976-02-16 20:56:31.220 | 1 | 2 | 4.0 | 127.0 | 19246197.0 | NaN | NaN | NaN |
| 2074763 | 134628743 | 1976-02-16 20:56:54.337 | 1 | 4 | 66.0 | 127.0 | 17200183.0 | NaN | NaN | NaN |
| 2267443 | 134628743 | 1976-02-16 20:58:12.070 | 1 | 6 | 9.0 | 127.0 | 9401923.0 | NaN | NaN | NaN |
| 2267444 | 134628743 | 1976-02-16 20:58:26.223 | 1 | 8 | 3.0 | 127.0 | 17200183.0 | NaN | NaN | NaN |
| 2011581 | 134629277 | 1976-02-16 20:58:06.137 | 1 | 1 | 16.0 | NaN | NaN | NaN | NaN | NaN |
1291547 rows × 10 columns
Но всё равно даже тут есть наны(
Нарисуем распределения¶
web_plotter = DistributionPlotter(web)
web_plotter.plot_all()
web_plotter.show_plot()
Странные данные 0_0¶
Поле pageview_duration_sec вообще какое-то странное. Есть наны и отрицательные числа...
Давайте смотреть, как так вышло!
web[(web["pageview_duration_sec"] < 0)]
| sessionkey_id | date_time | page_type | pageview_number | pageview_duration_sec | category_id | model_id | good_id | price | product_in_sale | |
|---|---|---|---|---|---|---|---|---|---|---|
| 2270348 | 110328896 | 1975-10-19 15:51:10.010 | 1 | 16 | -1.0 | 2873.0 | 144660.0 | 65175298.0 | 1178.0 | 1.0 |
| 2272147 | 110422717 | 1975-10-20 00:37:21.590 | 1 | 12 | -6.0 | 1241.0 | 16890898.0 | 62773803.0 | 732.0 | 1.0 |
| 2272150 | 110422717 | 1975-10-20 01:08:05.410 | 4 | 33 | -15.0 | 1229.0 | NaN | NaN | NaN | NaN |
| 2273210 | 110422717 | 1975-10-20 01:08:06.030 | 4 | 32 | -1.0 | 5673.0 | NaN | NaN | NaN | NaN |
| 2272429 | 110467977 | 1975-10-20 10:58:39.170 | 1 | 1 | -1.0 | 1330.0 | 3563114.0 | 20279782.0 | 1099.0 | 1.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2051829 | 134596933 | 1976-02-16 18:34:59.390 | 3 | 4 | -7.0 | NaN | NaN | NaN | NaN | NaN |
| 2056596 | 134609922 | 1976-02-16 20:08:51.777 | 2 | 23 | -9.0 | 7790.0 | NaN | NaN | NaN | NaN |
| 2056601 | 134609922 | 1976-02-16 20:20:10.567 | 2 | 46 | -6.0 | 7323.0 | NaN | NaN | NaN | NaN |
| 2266220 | 134616502 | 1976-02-16 20:03:29.110 | 2 | 6 | -1.0 | 201.0 | NaN | NaN | NaN | NaN |
| 2059338 | 134621944 | 1976-02-16 20:29:41.960 | 1 | 5 | -3.0 | 127.0 | 1799088.0 | 37969387.0 | 5331.0 | 1.0 |
2739 rows × 10 columns
web[(web["sessionkey_id"] == 110328896)]
| sessionkey_id | date_time | page_type | pageview_number | pageview_duration_sec | category_id | model_id | good_id | price | product_in_sale | |
|---|---|---|---|---|---|---|---|---|---|---|
| 2271131 | 110328896 | 1975-10-19 15:09:36.890 | 2 | 1 | 1.0 | 2873.0 | NaN | NaN | NaN | NaN |
| 2270344 | 110328896 | 1975-10-19 15:09:37.010 | 2 | 2 | 204.0 | 1241.0 | NaN | NaN | NaN | NaN |
| 2270345 | 110328896 | 1975-10-19 15:13:01.207 | 1 | 3 | 361.0 | 1241.0 | 27621026.0 | 65231588.0 | 360.0 | 1.0 |
| 2270346 | 110328896 | 1975-10-19 15:19:02.770 | 2 | 4 | 122.0 | 2446.0 | NaN | NaN | NaN | NaN |
| 2269942 | 110328896 | 1975-10-19 15:21:04.087 | 2 | 5 | 292.0 | 1333.0 | NaN | NaN | NaN | NaN |
| 2269943 | 110328896 | 1975-10-19 15:25:56.770 | 1 | 6 | 434.0 | 1333.0 | 9208426.0 | 64816713.0 | 405.0 | 1.0 |
| 2270347 | 110328896 | 1975-10-19 15:33:10.330 | 3 | 7 | 49.0 | NaN | NaN | NaN | NaN | NaN |
| 2270731 | 110328896 | 1975-10-19 15:33:59.310 | 2 | 8 | NaN | 1183.0 | NaN | NaN | NaN | NaN |
| 2270732 | 110328896 | 1975-10-19 15:37:10.950 | 1 | 11 | 365.0 | 1241.0 | 29232485.0 | 63119872.0 | 252.0 | 1.0 |
| 2270733 | 110328896 | 1975-10-19 15:43:15.350 | 1 | 12 | 83.0 | 1241.0 | 27704320.0 | 60621585.0 | 334.0 | 1.0 |
| 2270734 | 110328896 | 1975-10-19 15:44:38.830 | 3 | 13 | 73.0 | NaN | NaN | NaN | NaN | NaN |
| 2269944 | 110328896 | 1975-10-19 15:45:51.810 | 5 | 14 | 94.0 | NaN | NaN | NaN | NaN | NaN |
| 2270735 | 110328896 | 1975-10-19 15:47:25.613 | 1 | 15 | 225.0 | 2873.0 | 209585.0 | 60429426.0 | 1178.0 | 1.0 |
| 2271132 | 110328896 | 1975-10-19 15:51:09.210 | 5 | 17 | 114.0 | NaN | NaN | NaN | NaN | NaN |
| 2270348 | 110328896 | 1975-10-19 15:51:10.010 | 1 | 16 | -1.0 | 2873.0 | 144660.0 | 65175298.0 | 1178.0 | 1.0 |
| 2269945 | 110328896 | 1975-10-19 15:53:03.030 | 1 | 18 | 0.0 | 2873.0 | 6369236.0 | 63392163.0 | 1271.0 | 1.0 |
| 2271133 | 110328896 | 1975-10-19 15:53:03.570 | 5 | 19 | 34.0 | NaN | NaN | NaN | NaN | NaN |
| 2270736 | 110328896 | 1975-10-19 15:53:37.990 | 5 | 20 | 53.0 | NaN | NaN | NaN | NaN | NaN |
| 2271134 | 110328896 | 1975-10-19 15:54:30.190 | 2 | 21 | 18.0 | 2873.0 | NaN | NaN | NaN | NaN |
| 2270737 | 110328896 | 1975-10-19 15:54:48.030 | 2 | 22 | NaN | 2873.0 | NaN | NaN | NaN | NaN |
Как можно заметить, тут просто перепутался порядок в сессии
web[(web["sessionkey_id"] == 133729636)]
| sessionkey_id | date_time | page_type | pageview_number | pageview_duration_sec | category_id | model_id | good_id | price | product_in_sale | |
|---|---|---|---|---|---|---|---|---|---|---|
| 2250161 | 133729636 | 1976-02-10 13:54:36.863 | 1 | 1 | 249.0 | 1200.0 | 136805.0 | 28904311.0 | 2264.0 | 1.0 |
| 2044139 | 133729636 | 1976-02-10 13:58:45.423 | 1 | 2 | 3229.0 | 1200.0 | 136805.0 | 28904311.0 | 2264.0 | 1.0 |
| 2033903 | 133729636 | 1976-02-10 13:59:12.187 | 1 | 3 | 579.0 | 1200.0 | 19566244.0 | 62771283.0 | 1892.0 | 1.0 |
| 2033904 | 133729636 | 1976-02-10 14:08:51.613 | 3 | 4 | 97.0 | NaN | NaN | NaN | NaN | NaN |
| 2033905 | 133729636 | 1976-02-10 14:18:04.980 | 1 | 8 | 1.0 | 5605.0 | 132912.0 | 19870269.0 | 1790.0 | 1.0 |
| 1909261 | 133729636 | 1976-02-10 14:18:05.097 | 1 | 9 | 29.0 | 1200.0 | 136805.0 | 28904311.0 | 2264.0 | 1.0 |
| 1909262 | 133729636 | 1976-02-10 14:18:34.330 | 3 | 10 | 36.0 | NaN | NaN | NaN | NaN | NaN |
| 2033906 | 133729636 | 1976-02-10 14:19:10.247 | 1 | 11 | 17.0 | 1200.0 | 136805.0 | 28904311.0 | 2264.0 | 1.0 |
| 1909263 | 133729636 | 1976-02-10 14:19:27.727 | 3 | 12 | 8.0 | NaN | NaN | NaN | NaN | NaN |
| 2250162 | 133729636 | 1976-02-10 14:45:35.320 | 1 | 1 | -2810.0 | 1200.0 | 136805.0 | 28904311.0 | 2264.0 | 1.0 |
| 2250163 | 133729636 | 1976-02-10 14:51:14.417 | 1 | 2 | 80.0 | 1200.0 | 136805.0 | 28904311.0 | 2264.0 | 1.0 |
| 2250164 | 133729636 | 1976-02-10 14:52:34.117 | 1 | 3 | -2623.0 | 1200.0 | 136805.0 | 28904311.0 | 2264.0 | 1.0 |
| 2044140 | 133729636 | 1976-02-10 14:55:36.570 | 3 | 4 | -2708.0 | NaN | NaN | NaN | NaN | NaN |
| 2250165 | 133729636 | 1976-02-10 15:07:49.180 | 1 | 1 | -4144.0 | 1200.0 | 136805.0 | 28904311.0 | 2264.0 | 1.0 |
А здесь уже другая проблема, тут 3 сессии в одной. И они считают pageview_duration_sec через date_time друг друга
Train¶
Первичный осмотр¶
tr = pd.read_csv(TRAIN_DATASET_PATH, index_col="order_id")
tr["create_time"] = pd.to_datetime(tr["create_time"])
tr["model_create_time"] = pd.to_datetime(tr["model_create_time"])
tr.head()
| create_time | good_id | price | utm_medium | utm_source | sessionkey_id | category_id | parent_id | root_id | model_id | is_moderated | rating_value | rating_count | description_length | goods_qty | pics_qty | model_create_time | is_callcenter | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| order_id | ||||||||||||||||||
| 1269921 | 1975-12-26 09:30:08 | 9896348 | 753 | 5 | 8.0 | 123777004 | 139 | 133 | 124 | 123517 | 1 | 5.0 | 6.0 | 1204 | 6 | 2 | 1971-04-14 00:15:20 | 1 |
| 1270034 | 1975-12-26 10:28:57 | 9896348 | 753 | 1 | 2.0 | 123781654 | 139 | 133 | 124 | 123517 | 1 | 5.0 | 6.0 | 1204 | 6 | 2 | 1971-04-14 00:15:20 | 0 |
| 1268272 | 1975-12-25 11:24:28 | 9896348 | 753 | 2 | 3.0 | 123591002 | 139 | 133 | 124 | 123517 | 1 | 5.0 | 6.0 | 1204 | 6 | 2 | 1971-04-14 00:15:20 | 1 |
| 1270544 | 1975-12-26 14:16:06 | 9896348 | 753 | 1 | 1.0 | 123832302 | 139 | 133 | 124 | 123517 | 1 | 5.0 | 6.0 | 1204 | 6 | 2 | 1971-04-14 00:15:20 | 1 |
| 1270970 | 1975-12-26 18:21:47 | 9896348 | 753 | 3 | 56.0 | 123881603 | 139 | 133 | 124 | 123517 | 1 | 5.0 | 6.0 | 1204 | 6 | 2 | 1971-04-14 00:15:20 | 0 |
get_df_info(tr)
| dtype | nunique | nan | zero | empty string | example(-s) | mode, mode proportion | trash_score | |
|---|---|---|---|---|---|---|---|---|
| is_moderated | int64 | 2 | NaN | z: 0.049 | NaN | (0, 1) | (1.0, 0.951) | 0.951 |
| rating_value | float64 | 11 | n: 0.677 | NaN | NaN | (6.0, 10.0) | (5.0, 0.672) | 0.677 |
| is_callcenter | int64 | 2 | NaN | z: 0.645 | NaN | (0, 1) | (0.0, 0.645) | 0.645 |
| rating_count | float64 | 30 | n: 0.507 | z: 0.136 | NaN | (35.0, 13.0) | (1.0, 0.285) | 0.642 |
| description_length | int64 | 3106 | NaN | z: 0.384 | NaN | (802, 43) | (0.0, 0.384) | 0.384 |
| utm_source | float64 | 289 | n: 0.1 | NaN | NaN | (6.0, 227.0) | (1.0, 0.476) | 0.100 |
| model_create_time | datetime64[ns] | 31697 | n: 0.01 | NaN | NaN | (1974-12-22 19:30:29, 1974-05-15 21:33:11) | (1975-02-10 17:16:18, 0.005) | 0.010 |
| pics_qty | int64 | 34 | NaN | z: 0.005 | NaN | (19, 16) | (1.0, 0.366) | 0.005 |
| create_time | datetime64[ns] | 102998 | NaN | NaN | NaN | (1976-01-03 11:35:20, 1976-01-01 13:30:59) | (1976-01-20 10:49:10, 0.0) | NaN |
| good_id | int64 | 53691 | NaN | NaN | NaN | (59724690, 32240662) | (66921494.0, 0.002) | NaN |
| price | int64 | 6362 | NaN | NaN | NaN | (2444, 152) | (264.0, 0.009) | NaN |
| utm_medium | int64 | 8 | NaN | NaN | NaN | (1, 6) | (1.0, 0.457) | NaN |
| sessionkey_id | int64 | 96803 | NaN | NaN | NaN | (121264750, 112943865) | (125996889.0, 0.0) | NaN |
| category_id | int64 | 1733 | NaN | NaN | NaN | (7178, 3554) | (155.0, 0.09) | NaN |
| parent_id | int64 | 368 | NaN | NaN | NaN | (1542, 7948) | (154.0, 0.094) | NaN |
| root_id | int64 | 26 | NaN | NaN | NaN | (1481, 2303) | (1183.0, 0.264) | NaN |
| model_id | int64 | 37299 | NaN | NaN | NaN | (18677008, 4142405) | (18340251.0, 0.005) | NaN |
| goods_qty | int64 | 114 | NaN | NaN | NaN | (55, 8) | (1.0, 0.319) | NaN |
Нарисуем распределения¶
tr_plotter = DistributionPlotter(tr, hue_col="is_callcenter")
tr_plotter.plot_all()
tr_plotter.show_plot()
Test¶
Первичный осмотр¶
tst = pd.read_csv(TEST_DATASET_PATH, index_col="order_id")
tst["create_time"] = pd.to_datetime(tst["create_time"])
tst["model_create_time"] = pd.to_datetime(tst["model_create_time"])
tst.head()
| create_time | good_id | price | utm_medium | utm_source | sessionkey_id | category_id | parent_id | root_id | model_id | is_moderated | rating_value | rating_count | description_length | goods_qty | pics_qty | model_create_time | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| order_id | |||||||||||||||||
| 1350922 | 1976-02-05 15:08:37 | 9896348 | 1143 | 1 | 2.0 | 132744630 | 139 | 133 | 124 | 123517 | 1 | 5.0 | 6.0 | 1204 | 6 | 2 | 1971-04-14 00:15:20 |
| 1354989 | 1976-02-07 15:26:00 | 69445048 | 1707 | 1 | 1.0 | 133161905 | 136 | 133 | 124 | 123551 | 1 | 10.0 | 0.0 | 2010 | 26 | 3 | 1971-04-14 00:15:20 |
| 1352637 | 1976-02-06 11:43:58 | 70607886 | 576 | 1 | 1.0 | 132792626 | 136 | 133 | 124 | 123583 | 1 | 3.0 | 4.0 | 0 | 34 | 7 | 1971-04-14 00:15:20 |
| 1350050 | 1976-02-05 11:26:19 | 61918401 | 436 | 1 | 1.0 | 132683062 | 236 | 232 | 201 | 124228 | 1 | 4.0 | 1.0 | 0 | 2 | 4 | 1971-04-21 00:09:54 |
| 1341733 | 1976-02-01 19:36:32 | 37964900 | 573 | 6 | 4.0 | 131789790 | 138 | 133 | 124 | 123901 | 1 | 5.0 | 1.0 | 0 | 37 | 2 | 1971-04-16 10:52:08 |
get_df_info(tst)
| dtype | nunique | nan | zero | empty string | example(-s) | mode, mode proportion | trash_score | |
|---|---|---|---|---|---|---|---|---|
| is_moderated | int64 | 2 | NaN | z: 0.059 | NaN | (0, 1) | (1.0, 0.941) | 0.941 |
| rating_value | float64 | 11 | n: 0.7 | NaN | NaN | (6.0, 10.0) | (5.0, 0.712) | 0.700 |
| rating_count | float64 | 31 | n: 0.539 | z: 0.13 | NaN | (35.0, 21.0) | (1.0, 0.293) | 0.669 |
| description_length | int64 | 2050 | NaN | z: 0.423 | NaN | (647, 1737) | (0.0, 0.423) | 0.423 |
| utm_source | float64 | 130 | n: 0.09 | NaN | NaN | (53.0, 11.0) | (1.0, 0.439) | 0.090 |
| model_create_time | datetime64[ns] | 9443 | n: 0.011 | NaN | NaN | (1975-05-20 17:22:57, 1975-08-29 02:06:07) | (1975-02-10 17:16:18, 0.006) | 0.011 |
| pics_qty | int64 | 28 | NaN | z: 0.006 | NaN | (30, 47) | (1.0, 0.381) | 0.006 |
| create_time | datetime64[ns] | 16934 | NaN | NaN | NaN | (1976-02-09 18:42:42, 1976-02-05 11:40:07) | (1976-02-13 14:02:08, 0.0) | NaN |
| good_id | int64 | 12183 | NaN | NaN | NaN | (68707597, 31785151) | (59028240.0, 0.003) | NaN |
| price | int64 | 3299 | NaN | NaN | NaN | (1312, 3965) | (271.0, 0.007) | NaN |
| utm_medium | int64 | 8 | NaN | NaN | NaN | (6, 4) | (1.0, 0.404) | NaN |
| sessionkey_id | int64 | 16019 | NaN | NaN | NaN | (132350004, 133939423) | (132712616.0, 0.0) | NaN |
| category_id | int64 | 1071 | NaN | NaN | NaN | (3591, 4791) | (155.0, 0.106) | NaN |
| parent_id | int64 | 296 | NaN | NaN | NaN | (7373, 7941) | (154.0, 0.11) | NaN |
| root_id | int64 | 24 | NaN | NaN | NaN | (1481, 1478) | (1183.0, 0.237) | NaN |
| model_id | int64 | 10239 | NaN | NaN | NaN | (8289546, 21430000) | (18340251.0, 0.006) | NaN |
| goods_qty | int64 | 105 | NaN | NaN | NaN | (10, 103) | (1.0, 0.273) | NaN |
Ничего необычного, всё как в трейне
Нарисуем распределния¶
tst_plotter = DistributionPlotter(tst)
tst_plotter.plot_all()
tst_plotter.show_plot()
Небольшие выводы по тесту¶
Признак is_moderated распределенём по другому относительно train, поэтому думаю его лучше не использовать. Остальные признаки распределены примерно также
Ещё можно заметить, что sessionkey_id начинают идти после тех, что были в train, поэтому можно предположить, что мы прогнозируем данные из будущего
tr["create_time"].max() < tst["create_time"].min()
True
Это действительно так, так что для валидации будем отделим заказы по времени
EDA¶
Работа с сессиями¶
Первое, что хотелось бы сделать: сагрегировать данные с сессии
web.columns # Чтоб не забыть :)
Index(['sessionkey_id', 'date_time', 'page_type', 'pageview_number',
'pageview_duration_sec', 'category_id', 'model_id', 'good_id', 'price',
'product_in_sale'],
dtype='object')
(web["product_in_sale"].isna() == web["good_id"].isna()).all()
True
product_in_sale бесполезная колонка :/
agg_params = {
"session_length": ("sessionkey_id", lambda x: x.shape[0]),
#
"session_datetime_start": ("date_time", lambda x: x.iloc[0]),
"session_datetime_end": ("date_time", lambda x: x.iloc[-1]),
#
"last_page_type": ("page_type", lambda x: x.iloc[-1]),
**{
f"page_type_{i}": ("page_type", partial(lambda x, i: x[x == i].count(), i=i))
for i in (3, 6)
},
#
# **{
# f"page_type_{i}": ("page_type", partial(lambda x, i: x[x == i].count(), i=i))
# for i in range(1, 13 + 1)
# }, # Самыми полезными получились 3 и 6, чтоб долго не считать сделал только их
#
#
"last_pageview_number": ("pageview_number", lambda x: x.max()),
#
"pageview_duration_sec_last": ("pageview_duration_sec", lambda x: x.iloc[-1]),
"pageview_duration_sec_sum": ("pageview_duration_sec", lambda x: np.nansum(x)),
"pageview_duration_sec_min": ("pageview_duration_sec", lambda x: x.min()),
"pageview_duration_sec_max": ("pageview_duration_sec", lambda x: x.max()),
#
"categories": ("category_id", lambda x: set(x[~x.isna()].astype(int))),
#
"models": ("model_id", lambda x: set(x[~x.isna()].astype(int))),
#
"goods": ("good_id", lambda x: set(x[~x.isna()].astype(int))),
#
"price_min": ("price", lambda x: x.min()),
"price_max": ("price", lambda x: x.max()),
}
web_aggregate = web.groupby("sessionkey_id", sort=False).agg(**agg_params)
web_aggregate["datetime_diff"] = (
web_aggregate["session_datetime_end"] - web_aggregate["session_datetime_start"]
).dt.total_seconds()
web_aggregate["timedelta_1"] = (
web_aggregate["datetime_diff"] - web_aggregate["pageview_duration_sec_sum"]
)
for i in (3, 6):
web_aggregate[f"page_type_{i}_proportion"] = (
web_aggregate[f"page_type_{i}"] / web_aggregate["session_length"]
)
# for i in range(1, 13 + 1):
# web_aggregate[f"page_type_{i}_proportion"] = (
# web_aggregate[f"page_type_{i}"] / web_aggregate["session_length"]
# )
web_aggregate.sample(5)
| session_length | session_datetime_start | session_datetime_end | last_page_type | page_type_3 | page_type_6 | last_pageview_number | pageview_duration_sec_last | pageview_duration_sec_sum | pageview_duration_sec_min | pageview_duration_sec_max | categories | models | goods | price_min | price_max | datetime_diff | timedelta_1 | page_type_3_proportion | page_type_6_proportion | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| sessionkey_id | ||||||||||||||||||||
| 131876093 | 3 | 1976-02-01 17:34:26.507 | 1976-02-01 17:36:39.717 | 6 | 1 | 1 | 6 | NaN | 33.0 | 8.0 | 25.0 | {4449} | {27730843} | {60396422} | 256.0 | 256.0 | 133.210 | 100.210 | 0.333333 | 0.333333 |
| 125297077 | 12 | 1976-01-03 18:11:41.850 | 1976-01-03 18:27:26.823 | 3 | 2 | 1 | 15 | NaN | 648.0 | 13.0 | 102.0 | {1701} | {12164171, 23209991, 16750511} | {42056988, 42056958, 42056959} | 353.0 | 494.0 | 944.973 | 296.973 | 0.166667 | 0.083333 |
| 125782858 | 1 | 1976-01-06 11:39:00.443 | 1976-01-06 11:39:00.443 | 1 | 0 | 0 | 1 | NaN | 0.0 | NaN | NaN | {1214} | {3745181} | {29541343} | 2212.0 | 2212.0 | 0.000 | 0.000 | 0.000000 | 0.000000 |
| 113213995 | 6 | 1975-11-02 15:40:19.243 | 1975-11-02 15:53:13.903 | 3 | 2 | 0 | 10 | NaN | 502.0 | 11.0 | 410.0 | {257, 2873} | {209585, 17044329} | {60391505, 59847471} | 1115.0 | 1271.0 | 774.660 | 272.660 | 0.333333 | 0.000000 |
| 125681519 | 3 | 1976-01-05 19:44:44.897 | 1976-01-05 19:46:24.053 | 1 | 0 | 0 | 3 | NaN | 100.0 | 32.0 | 68.0 | {1200} | {1531418} | {30456057} | 2204.0 | 2204.0 | 99.156 | -0.844 | 0.000000 | 0.000000 |
Создание обущающей выборки¶
def transform(data: pd.DataFrame, web_aggregate: pd.DataFrame):
data_transformed = data.join(web_aggregate, "sessionkey_id")
data_transformed["timedelta_2"] = (
data_transformed["create_time"] - data_transformed["session_datetime_start"]
).dt.total_seconds()
data_transformed["timedelta_3"] = (
data_transformed["session_datetime_end"] - data_transformed["create_time"]
).dt.total_seconds()
X = data_transformed.drop(
columns=[
"create_time",
"model_create_time",
"session_datetime_start",
"session_datetime_end",
"sessionkey_id",
"categories",
"models",
"goods",
"is_moderated",
]
)
if "is_callcenter" in data_transformed.columns:
return X.drop(columns=["is_callcenter"]), X.is_callcenter.values
return X
X, y = transform(tr, web_aggregate)
sns.boxplot(y=X["page_type_3"], hue=y, showfliers=False)
plt.show()
sns.boxplot(y=X["page_type_6"], hue=y, showfliers=False)
plt.show()
sns.boxplot(y=X["timedelta_3"], hue=y, showfliers=False)
plt.show()
Обучение и анализ модели (1 балл)¶
X_tr, X_val, y_tr, y_val = train_test_split(X, y, test_size=0.25, shuffle=False)
get_df_info(X)
| dtype | nunique | nan | zero | empty string | example(-s) | mode, mode proportion | trash_score | |
|---|---|---|---|---|---|---|---|---|
| pageview_duration_sec_last | float64 | 710 | n: 0.804 | z: 0.001 | NaN | (115.0, 183.0) | (6.0, 0.048) | 0.805 |
| page_type_6 | float64 | 37 | n: 0.006 | z: 0.673 | NaN | (31.0, 13.0) | (0.0, 0.677) | 0.679 |
| page_type_6_proportion | float64 | 832 | n: 0.006 | z: 0.673 | NaN | (0.225, 0.13846153846153847) | (0.0, 0.677) | 0.679 |
| rating_value | float64 | 11 | n: 0.677 | NaN | NaN | (6.0, 10.0) | (5.0, 0.672) | 0.677 |
| rating_count | float64 | 30 | n: 0.507 | z: 0.136 | NaN | (35.0, 13.0) | (1.0, 0.285) | 0.642 |
| description_length | int64 | 3106 | NaN | z: 0.384 | NaN | (802.0, 43.0) | (0.0, 0.384) | 0.384 |
| page_type_3_proportion | float64 | 1253 | n: 0.006 | z: 0.375 | NaN | (0.2459016393442623, 0.13690476190476192) | (0.0, 0.377) | 0.381 |
| page_type_3 | float64 | 54 | n: 0.006 | z: 0.375 | NaN | (17.0, 56.0) | (0.0, 0.377) | 0.381 |
| pageview_duration_sec_min | float64 | 1346 | n: 0.102 | z: 0.049 | NaN | (903.0, 136.0) | (4.0, 0.086) | 0.151 |
| datetime_diff | float64 | 83298 | n: 0.006 | z: 0.098 | NaN | (1993.894, 2653.873) | (0.0, 0.099) | 0.104 |
| pageview_duration_sec_max | float64 | 1875 | n: 0.102 | z: 0.001 | NaN | (13393.0, 621.0) | (30.0, 0.003) | 0.103 |
| timedelta_1 | float64 | 56348 | n: 0.006 | z: 0.097 | NaN | (245.52999999999997, 707.4899999999998) | (0.0, 0.098) | 0.103 |
| pageview_duration_sec_sum | float64 | 6252 | n: 0.006 | z: 0.097 | NaN | (821.0, 3399.0) | (0.0, 0.097) | 0.103 |
| utm_source | float64 | 289 | n: 0.1 | NaN | NaN | (6.0, 227.0) | (1.0, 0.476) | 0.100 |
| price_min | float64 | 5712 | n: 0.081 | NaN | NaN | (316.0, 8213.0) | (264.0, 0.005) | 0.081 |
| price_max | float64 | 7489 | n: 0.081 | NaN | NaN | (1719.0, 7930.0) | (952.0, 0.005) | 0.081 |
| last_page_type | float64 | 14 | n: 0.006 | NaN | NaN | (10.0, 11.0) | (1.0, 0.437) | 0.006 |
| session_length | float64 | 253 | n: 0.006 | NaN | NaN | (208.0, 12.0) | (1.0, 0.099) | 0.006 |
| timedelta_2 | float64 | 99461 | n: 0.006 | NaN | NaN | (1854.56, 2090.04) | (581.183, 0.0) | 0.006 |
| last_pageview_number | float64 | 253 | n: 0.006 | NaN | NaN | (155.0, 8.0) | (1.0, 0.098) | 0.006 |
| timedelta_3 | float64 | 96662 | n: 0.006 | NaN | NaN | (13.903, -2965.903) | (37.277, 0.0) | 0.006 |
| pics_qty | int64 | 34 | NaN | z: 0.005 | NaN | (19.0, 16.0) | (1.0, 0.366) | 0.005 |
| good_id | int64 | 53691 | NaN | NaN | NaN | (59724690.0, 32240662.0) | (66921494.0, 0.002) | NaN |
| price | int64 | 6362 | NaN | NaN | NaN | (2444.0, 152.0) | (264.0, 0.009) | NaN |
| utm_medium | int64 | 8 | NaN | NaN | NaN | (1.0, 6.0) | (1.0, 0.457) | NaN |
| category_id | int64 | 1733 | NaN | NaN | NaN | (7178.0, 3554.0) | (155.0, 0.09) | NaN |
| parent_id | int64 | 368 | NaN | NaN | NaN | (1542.0, 7948.0) | (154.0, 0.094) | NaN |
| root_id | int64 | 26 | NaN | NaN | NaN | (1481.0, 2303.0) | (1183.0, 0.264) | NaN |
| model_id | int64 | 37299 | NaN | NaN | NaN | (18677008.0, 4142405.0) | (18340251.0, 0.005) | NaN |
| goods_qty | int64 | 114 | NaN | NaN | NaN | (55.0, 8.0) | (1.0, 0.319) | NaN |
cat_features = [
"utm_medium",
# "good_id",
# "category_id",
# "parent_id",
"root_id",
# "model_id",
"last_page_type",
]
train_dataset = lgb.Dataset(X_tr, y_tr, categorical_feature=cat_features)
val_dataset = lgb.Dataset(X_val, y_val, categorical_feature=cat_features)
model = lgb.train(
{
"boosting_type": "dart",
"eta": 0.15,
"objective": "binary",
"metric": ["auc", ""],
"neg_bagging_fraction": 0.2,
},
train_dataset,
100,
[val_dataset],
["Validation"],
callbacks=[
lgb.log_evaluation(3),
],
)
t = model.trees_to_dataframe()
[LightGBM] [Warning] Met categorical feature which contains sparse values. Consider renumbering to consecutive integers started from zero [LightGBM] [Info] Number of positive: 28110, number of negative: 50336 [LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.015987 seconds. You can set `force_col_wise=true` to remove the overhead. [LightGBM] [Info] Total Bins 5261 [LightGBM] [Info] Number of data points in the train set: 78446, number of used features: 30 [LightGBM] [Info] [binary:BoostFromScore]: pavg=0.358336 -> initscore=-0.582595 [LightGBM] [Info] Start training from score -0.582595 [3] Validation's auc: 0.957261 [6] Validation's auc: 0.957913 [9] Validation's auc: 0.958335 [12] Validation's auc: 0.959779 [15] Validation's auc: 0.960384 [18] Validation's auc: 0.962162 [21] Validation's auc: 0.963195 [24] Validation's auc: 0.963843 [27] Validation's auc: 0.964151 [30] Validation's auc: 0.964368 [33] Validation's auc: 0.964561 [36] Validation's auc: 0.964596 [39] Validation's auc: 0.964716 [42] Validation's auc: 0.964841 [45] Validation's auc: 0.9649 [48] Validation's auc: 0.965022 [51] Validation's auc: 0.965084 [54] Validation's auc: 0.965132 [57] Validation's auc: 0.965187 [60] Validation's auc: 0.965206 [63] Validation's auc: 0.965315 [66] Validation's auc: 0.965247 [69] Validation's auc: 0.965418 [72] Validation's auc: 0.965437 [75] Validation's auc: 0.965394 [78] Validation's auc: 0.965377 [81] Validation's auc: 0.965421 [84] Validation's auc: 0.965449 [87] Validation's auc: 0.965543 [90] Validation's auc: 0.965529 [93] Validation's auc: 0.965575 [96] Validation's auc: 0.965572 [99] Validation's auc: 0.965552
plot_scores(model, X_tr, y_tr, X_val, y_val)
Можно заметить пики в середнине raw_scores, дальше я обращу на них внимание
plot_tree_info(t)
plot_feature_info(t)

t.query("split_feature == 'page_type_3'")
| tree_index | node_depth | node_index | left_child | right_child | parent_index | split_feature | split_gain | threshold | decision_type | missing_direction | missing_type | value | weight | count | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 1 | 0-S0 | 0-S1 | 0-S2 | None | page_type_3 | 26844.900391 | 0.0 | <= | left | NaN | -0.321022 | 0.0000 | 78446 |
| 61 | 1 | 1 | 1-S0 | 1-S1 | 1-S2 | None | page_type_3 | 19225.099609 | 0.0 | <= | left | NaN | 0.000000 | 0.0000 | 78446 |
| 122 | 2 | 1 | 2-S0 | 2-S1 | 2-S2 | None | page_type_3 | 14347.599609 | 0.0 | <= | left | NaN | 0.000000 | 0.0000 | 78446 |
| 183 | 3 | 1 | 3-S0 | 3-S1 | 3-S2 | None | page_type_3 | 10951.400391 | 0.0 | <= | left | NaN | 0.000000 | 0.0000 | 78446 |
| 244 | 4 | 1 | 4-S0 | 4-S1 | 4-S2 | None | page_type_3 | 8475.990234 | 0.0 | <= | left | NaN | 0.000000 | 0.0000 | 78446 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 5795 | 95 | 1 | 95-S0 | 95-S1 | 95-S2 | None | page_type_3 | 104.882004 | 0.0 | <= | left | NaN | 0.000000 | 0.0000 | 78446 |
| 5843 | 95 | 9 | 95-S24 | 95-L11 | 95-S25 | 95-S23 | page_type_3 | 10.002700 | 1.5 | <= | left | NaN | -0.002012 | 327.6110 | 3022 |
| 5856 | 96 | 1 | 96-S0 | 96-S1 | 96-S2 | None | page_type_3 | 35.998798 | 0.0 | <= | left | NaN | 0.000000 | 0.0000 | 78446 |
| 5917 | 97 | 1 | 97-S0 | 97-S1 | 97-S2 | None | page_type_3 | 121.742996 | 0.0 | <= | left | NaN | 0.000000 | 0.0000 | 78446 |
| 6011 | 98 | 9 | 98-S15 | 98-L10 | 98-S16 | 98-S13 | page_type_3 | 10.774100 | 1.5 | <= | left | NaN | -0.022046 | 82.1623 | 938 |
99 rows × 15 columns
page_type_3 используется в основном только на 1 сплите по порогу 0. Возможно стоит сделать её бинарной, чтобы не переобучать модель.
t.query("split_feature == 'timedelta_3'")
| tree_index | node_depth | node_index | left_child | right_child | parent_index | split_feature | split_gain | threshold | decision_type | missing_direction | missing_type | value | weight | count | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 0 | 2 | 0-S1 | 0-S10 | 0-S9 | 0-S0 | timedelta_3 | 10157.799805 | -2164.93 | <= | left | NaN | -0.192504 | 6872.6400 | 29890 |
| 3 | 0 | 4 | 0-S11 | 0-S16 | 0-S18 | 0-S10 | timedelta_3 | 172.850006 | -5370.1865 | <= | right | NaN | -0.286102 | 2387.1500 | 10382 |
| 20 | 0 | 5 | 0-S22 | 0-L16 | 0-S26 | 0-S15 | timedelta_3 | 51.521900 | -1955.455 | <= | left | NaN | -0.095339 | 3621.8800 | 15752 |
| 30 | 0 | 4 | 0-S14 | 0-L1 | 0-L15 | 0-S3 | timedelta_3 | 146.151001 | -539.3265 | <= | left | NaN | -0.433084 | 2299.7700 | 10002 |
| 34 | 0 | 5 | 0-S5 | 0-L4 | 0-S12 | 0-S4 | timedelta_3 | 3377.439941 | -1955.455 | <= | left | NaN | -0.215084 | 1462.5900 | 6361 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 6057 | 99 | 6 | 99-S6 | 99-S18 | 99-S16 | 99-S5 | timedelta_3 | 18.795799 | -435.3185 | <= | left | NaN | 0.046040 | 491.6040 | 19541 |
| 6075 | 99 | 2 | 99-S2 | 99-S10 | 99-S7 | 99-S0 | timedelta_3 | 18.351500 | 0.0 | <= | left | NaN | -0.036730 | 643.1310 | 25025 |
| 6077 | 99 | 4 | 99-S12 | 99-S20 | 99-L13 | 99-S10 | timedelta_3 | 11.489500 | -539.3265 | <= | left | NaN | -0.058996 | 91.6342 | 3624 |
| 6082 | 99 | 4 | 99-S11 | 99-S15 | 99-L12 | 99-S10 | timedelta_3 | 15.029200 | -198.09 | <= | left | NaN | 0.004037 | 249.5390 | 1625 |
| 6095 | 99 | 7 | 99-S28 | 99-L28 | 99-S29 | 99-S27 | timedelta_3 | 17.743099 | 36.192 | <= | left | NaN | 0.132610 | 12.9024 | 382 |
732 rows × 15 columns
Что-то странное модель использует timedelta_3 на всех деревьях при том всегда не на 1 уровне. И это не эффект маленького eta, оно делает так всегда.
Так же у неё всегда разный threshold. Возможно стоит использовать линейную модель на ней
px.scatter(X, x="timedelta_3", y=X["page_type_3"] > 0, color=y).update_layout(
yaxis_title="page_type_3 > 0",
)

Но что-то не похоже, что тут оптимально будет применять линейную модель :/
Но зато видно как timedelta_3 помогла в моменте, где page_type_3 > 0. Думаю если почистить данные, можно добиться большего скора!
И ещё возможно не стоит ставить много деревьев в модель, чтобы она не начала захватывать вкрапления синих и жёлтых точек. Я думаю, что она это и делает, поэтому использует timedelta_3 на всех деревьях
Благодаря этому, я попытаюсь сделать стабильную модель и выберу её второй для private leaderboard
Блок с баллами (26 баллов)¶
1. Понижение размерности (5 баллов)¶
raise Exception # Чтоб не перезапускало ячейки
--------------------------------------------------------------------------- Exception Traceback (most recent call last) Cell In[34], line 1 ----> 1 raise Exception # Чтоб не перезапускало ячейки Exception:
Попытка ввести понижение размерности на признаках, хорошо отражающих объекты¶
В обучающей выборке много категориальных колонок. При том они не интепритируемы (по крайней мере лично мной), потому что нам не дали никакой информации о том, что значат их значения :(
Дерево разделило utm_medium на группы (1, 3, 4, 5) и (6, 7), это можно использовать
model.trees_to_dataframe().query("split_feature == 'utm_medium'")[:5]
| tree_index | node_depth | node_index | left_child | right_child | parent_index | split_feature | split_gain | threshold | decision_type | missing_direction | missing_type | value | weight | count | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2 | 0 | 3 | 0-S10 | 0-S11 | 0-S17 | 0-S1 | utm_medium | 280.731995 | 1||3||4||5 | == | right | NaN | -0.300278 | 3196.040 | 13900 |
| 63 | 1 | 3 | 1-S9 | 1-S22 | 1-S11 | 1-S1 | utm_medium | 202.996994 | 6||7 | == | right | NaN | 0.014432 | 3225.800 | 13900 |
| 124 | 2 | 3 | 2-S10 | 2-S14 | 2-S20 | 2-S1 | utm_medium | 155.783005 | 1||3||4||5 | == | right | NaN | 0.018100 | 3244.450 | 13900 |
| 132 | 2 | 4 | 2-S20 | 2-S22 | 2-L21 | 2-S10 | utm_medium | 44.249699 | 2||7 | == | right | NaN | -0.021110 | 793.688 | 3518 |
| 185 | 3 | 3 | 3-S11 | 3-S14 | 3-S21 | 3-S1 | utm_medium | 117.456001 | 1||3||4||5 | == | right | NaN | 0.015245 | 3306.490 | 14101 |
features = [
"page_type_3",
"timedelta_3",
"timedelta_1",
"pageview_duration_sec_last",
"page_type_6",
"timedelta_2",
"price",
"pageview_duration_sec_max",
"price_min",
"page_type_3_proportion",
"utm_medium_score",
"price_max",
"pageview_duration_sec_min",
"pageview_duration_sec_sum",
"page_type_6_proportion",
]
def transform_dr(X: pd.DataFrame, y: np.ndarray = None):
data = X.copy()
utm_medium_mapper = {
1: 100,
2: 1,
3: 100,
4: 100,
5: 100,
6: 10,
7: 10,
8: 1,
}
data["utm_medium_score"] = data["utm_medium"].map(utm_medium_mapper)
if y is not None:
data["is_callcenter"] = y
data.fillna(-100, inplace=True)
return data
get_df_info(transform_dr(X, y))
| dtype | nunique | nan | zero | empty string | example(-s) | mode, mode proportion | trash_score | |
|---|---|---|---|---|---|---|---|---|
| pageview_duration_sec_last | float64 | 710 | NaN | z: 0.001 | NaN | (201.0, 189.0) | (-100.0, 0.804) | 0.804 |
| page_type_6 | float64 | 37 | NaN | z: 0.673 | NaN | (18.0, 15.0) | (0.0, 0.673) | 0.673 |
| page_type_6_proportion | float64 | 832 | NaN | z: 0.673 | NaN | (0.14754098360655737, 0.13846153846153847) | (0.0, 0.673) | 0.673 |
| is_callcenter | int64 | 2 | NaN | z: 0.645 | NaN | (0.0, 1.0) | (0.0, 0.645) | 0.645 |
| description_length | int64 | 3106 | NaN | z: 0.384 | NaN | (802.0, 43.0) | (0.0, 0.384) | 0.384 |
| page_type_3 | float64 | 54 | NaN | z: 0.375 | NaN | (12.0, 35.0) | (0.0, 0.375) | 0.375 |
| page_type_3_proportion | float64 | 1253 | NaN | z: 0.375 | NaN | (0.22058823529411764, 0.13690476190476192) | (0.0, 0.375) | 0.375 |
| rating_count | float64 | 30 | NaN | z: 0.136 | NaN | (34.0, 20.0) | (-100.0, 0.507) | 0.136 |
| datetime_diff | float64 | 83298 | NaN | z: 0.098 | NaN | (1993.894, 1852.664) | (0.0, 0.098) | 0.098 |
| pageview_duration_sec_sum | float64 | 6252 | NaN | z: 0.097 | NaN | (3139.0, 3353.0) | (0.0, 0.097) | 0.097 |
| timedelta_1 | float64 | 56347 | NaN | z: 0.097 | NaN | (-0.15699999999998226, -0.4860000000001037) | (0.0, 0.097) | 0.097 |
| pageview_duration_sec_min | float64 | 1346 | NaN | z: 0.049 | NaN | (664.0, 665.0) | (-100.0, 0.102) | 0.049 |
| pics_qty | int64 | 34 | NaN | z: 0.005 | NaN | (19.0, 16.0) | (1.0, 0.366) | 0.005 |
| pageview_duration_sec_max | float64 | 1875 | NaN | z: 0.001 | NaN | (6573.0, 1189.0) | (-100.0, 0.102) | 0.001 |
| good_id | int64 | 53691 | NaN | NaN | NaN | (59724690.0, 32240662.0) | (66921494.0, 0.002) | NaN |
| price | int64 | 6362 | NaN | NaN | NaN | (2444.0, 152.0) | (264.0, 0.009) | NaN |
| utm_medium | int64 | 8 | NaN | NaN | NaN | (1.0, 6.0) | (1.0, 0.457) | NaN |
| utm_source | float64 | 289 | NaN | NaN | NaN | (35.0, 273.0) | (1.0, 0.428) | NaN |
| category_id | int64 | 1733 | NaN | NaN | NaN | (7178.0, 3554.0) | (155.0, 0.09) | NaN |
| parent_id | int64 | 368 | NaN | NaN | NaN | (1542.0, 7948.0) | (154.0, 0.094) | NaN |
| root_id | int64 | 26 | NaN | NaN | NaN | (1481.0, 2303.0) | (1183.0, 0.264) | NaN |
| model_id | int64 | 37299 | NaN | NaN | NaN | (18677008.0, 4142405.0) | (18340251.0, 0.005) | NaN |
| rating_value | float64 | 11 | NaN | NaN | NaN | (4.0, 5.0) | (-100.0, 0.677) | NaN |
| goods_qty | int64 | 114 | NaN | NaN | NaN | (55.0, 8.0) | (1.0, 0.319) | NaN |
| session_length | float64 | 253 | NaN | NaN | NaN | (383.0, 12.0) | (1.0, 0.098) | NaN |
| last_page_type | float64 | 14 | NaN | NaN | NaN | (-100.0, 12.0) | (1.0, 0.435) | NaN |
| last_pageview_number | float64 | 253 | NaN | NaN | NaN | (265.0, 8.0) | (1.0, 0.097) | NaN |
| price_min | float64 | 5712 | NaN | NaN | NaN | (1774.0, 2098.0) | (-100.0, 0.081) | NaN |
| price_max | float64 | 7489 | NaN | NaN | NaN | (16529.0, 2992.0) | (-100.0, 0.081) | NaN |
| timedelta_2 | float64 | 99461 | NaN | NaN | NaN | (364.843, 4837.257) | (-100.0, 0.006) | NaN |
| timedelta_3 | float64 | 96662 | NaN | NaN | NaN | (-159.94, -2965.903) | (-100.0, 0.006) | NaN |
| utm_medium_score | int64 | 3 | NaN | NaN | NaN | (100.0, 1.0) | (100.0, 0.734) | NaN |
data = transform_dr(X_tr, y_tr)
mapper_dict = {
"TSNE 2D": {
"params": {
"perplexity": 30,
"n_components": 2,
#
"n_jobs": 32,
"verbose": False,
},
"func": make_tsne,
},
"UMAP 2D without y": {
"params": {
"n_neighbors": 15,
"min_dist": 0.1,
"n_components": 2,
#
"n_jobs": 32,
"verbose": False,
},
"func": make_umap,
},
"UMAP 2D with y": {
"params": {
"n_neighbors": 15,
"min_dist": 0.1,
"n_components": 2,
#
"n_jobs": 32,
"verbose": False,
},
"func": partial(make_umap, y=data["is_callcenter"]),
},
}
drplotter = DimReductionPlotter()
results = drplotter.plot_dim_reduction(
data,
mapper_dict,
default_features=features,
default_hue_info=("is_callcenter", True),
)

Картинки для разных is_callcenter получаются почти одинаковыми :(. Только UMAP с метками работает неплохо
При этом есть длинные хвосты, которые скорее всего появились из выбросов в данных (нескольких сессий в одной и отрицательных значений). Их можно исправить руками
clusterer = hdbscan.HDBSCAN(
min_cluster_size=200,
gen_min_span_tree=True,
prediction_data=True,
)
X_tr["cluster_id"] = clusterer.fit_predict(results["UMAP 2D with y"]["embedding"])
X_val["cluster_id"] = hdbscan.approximate_predict(
clusterer,
results["UMAP 2D with y"]["mapper"].transform(transform_dr(X_val)[features].values),
)[0]
train_dataset = lgb.Dataset(
X_tr, y_tr, categorical_feature=cat_features + ["cluster_id"]
)
val_dataset = lgb.Dataset(
X_val, y_val, categorical_feature=cat_features + ["cluster_id"]
)
model = lgb.train(
{
"boosting_type": "dart",
"eta": 0.2,
"objective": "binary",
"metric": ["auc", ""],
"neg_bagging_fraction": 0.2,
},
train_dataset,
100,
[val_dataset],
["Validation"],
callbacks=[
lgb.log_evaluation(3),
],
)
[LightGBM] [Warning] Met categorical feature which contains sparse values. Consider renumbering to consecutive integers started from zero [LightGBM] [Warning] Met negative value in categorical features, will convert it to NaN [LightGBM] [Info] Number of positive: 28110, number of negative: 50336 [LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001420 seconds. You can set `force_col_wise=true` to remove the overhead. [LightGBM] [Info] Total Bins 5283 [LightGBM] [Info] Number of data points in the train set: 78446, number of used features: 31 [LightGBM] [Info] [binary:BoostFromScore]: pavg=0.358336 -> initscore=-0.582595 [LightGBM] [Info] Start training from score -0.582595 [3] Validation's auc: 0.808559 [6] Validation's auc: 0.809859 [9] Validation's auc: 0.810023 [12] Validation's auc: 0.798178 [15] Validation's auc: 0.797292 [18] Validation's auc: 0.889415 [21] Validation's auc: 0.892297 [24] Validation's auc: 0.890831 [27] Validation's auc: 0.887568 [30] Validation's auc: 0.889944 [33] Validation's auc: 0.888973 [36] Validation's auc: 0.886435 [39] Validation's auc: 0.889596 [42] Validation's auc: 0.889336 [45] Validation's auc: 0.889704 [48] Validation's auc: 0.891321 [51] Validation's auc: 0.891555 [54] Validation's auc: 0.889247 [57] Validation's auc: 0.888259 [60] Validation's auc: 0.888992 [63] Validation's auc: 0.887077 [66] Validation's auc: 0.888238 [69] Validation's auc: 0.886502 [72] Validation's auc: 0.886932 [75] Validation's auc: 0.890603 [78] Validation's auc: 0.890796 [81] Validation's auc: 0.890969 [84] Validation's auc: 0.894233 [87] Validation's auc: 0.893101 [90] Validation's auc: 0.892958 [93] Validation's auc: 0.891136 [96] Validation's auc: 0.8913 [99] Validation's auc: 0.890424
from sklearn.metrics import roc_auc_score
roc_auc_score(y_tr, model.predict(X_tr))
0.9986133563103403
Слишком сильно переобучились из-за даталика в UMAP :(
X_tr.drop(columns=["cluster_id"], inplace=True, errors="ignore")
X_val.drop(columns=["cluster_id"], inplace=True, errors="ignore")
А что если не is_callcenter¶
data = transform_dr(X_tr, y_tr)
mapper_dict = {
"TSNE 2D": {
"params": {
"perplexity": 30,
"n_components": 2,
#
"n_jobs": 32,
"verbose": False,
},
"func": make_tsne,
},
"UMAP 2D without y": {
"params": {
"n_neighbors": 15,
"min_dist": 0.1,
"n_components": 2,
#
"n_jobs": 32,
"verbose": False,
},
"func": make_umap,
},
"UMAP 2D with y": {
"params": {
"n_neighbors": 15,
"min_dist": 0.1,
"n_components": 2,
#
"n_jobs": 32,
"verbose": False,
},
"func": partial(make_umap, y=data["utm_medium"]),
},
}
drplotter = DimReductionPlotter()
results = drplotter.plot_dim_reduction(
data,
mapper_dict,
default_features=features,
default_hue_info=("utm_medium", True),
)

Те же самые хвосты, но utm_medium разбросан почти равномерно
data = transform_dr(X_tr, y_tr)
mapper_dict = {
"TSNE 2D": {
"params": {
"perplexity": 30,
"n_components": 2,
#
"n_jobs": 32,
"verbose": False,
},
"func": make_tsne,
},
"UMAP 2D without y": {
"params": {
"n_neighbors": 15,
"min_dist": 0.1,
"n_components": 2,
#
"n_jobs": 32,
"verbose": False,
},
"func": make_umap,
},
}
drplotter = DimReductionPlotter()
results = drplotter.plot_dim_reduction(
data,
mapper_dict,
default_features=features,
default_hue_info=("root_id", False),
)

Тут также :/
Посмотрим на скоры¶
train_dataset = lgb.Dataset(X_tr, y_tr, categorical_feature=cat_features)
val_dataset = lgb.Dataset(X_val, y_val, categorical_feature=cat_features)
model = lgb.train(
{
"boosting_type": "dart",
"eta": 0.15,
"objective": "binary",
"metric": ["auc", ""],
"neg_bagging_fraction": 0.2,
},
train_dataset,
100,
[val_dataset],
["Validation"],
callbacks=[
lgb.log_evaluation(3),
],
)
t = model.trees_to_dataframe()
[LightGBM] [Warning] Met categorical feature which contains sparse values. Consider renumbering to consecutive integers started from zero [LightGBM] [Info] Number of positive: 28110, number of negative: 50336 [LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.073822 seconds. You can set `force_row_wise=true` to remove the overhead. And if memory is not enough, you can set `force_col_wise=true`. [LightGBM] [Info] Total Bins 5261 [LightGBM] [Info] Number of data points in the train set: 78446, number of used features: 30 [LightGBM] [Info] [binary:BoostFromScore]: pavg=0.358336 -> initscore=-0.582595 [LightGBM] [Info] Start training from score -0.582595 [3] Validation's auc: 0.957261 [6] Validation's auc: 0.957913 [9] Validation's auc: 0.958335 [12] Validation's auc: 0.959779 [15] Validation's auc: 0.960384 [18] Validation's auc: 0.962162 [21] Validation's auc: 0.963195 [24] Validation's auc: 0.963843 [27] Validation's auc: 0.964151 [30] Validation's auc: 0.964368 [33] Validation's auc: 0.964561 [36] Validation's auc: 0.964596 [39] Validation's auc: 0.964716 [42] Validation's auc: 0.964841 [45] Validation's auc: 0.9649 [48] Validation's auc: 0.965022 [51] Validation's auc: 0.965084 [54] Validation's auc: 0.965132 [57] Validation's auc: 0.965187 [60] Validation's auc: 0.965206 [63] Validation's auc: 0.965315 [66] Validation's auc: 0.965247 [69] Validation's auc: 0.965418 [72] Validation's auc: 0.965437 [75] Validation's auc: 0.965394 [78] Validation's auc: 0.965377 [81] Validation's auc: 0.965421 [84] Validation's auc: 0.965449 [87] Validation's auc: 0.965543 [90] Validation's auc: 0.965529 [93] Validation's auc: 0.965575 [96] Validation's auc: 0.965572 [99] Validation's auc: 0.965552
leaves = model.predict(X_val, pred_leaf=True)
scores = np.array(
[
[model.get_leaf_output(i, leaves[j, i]) for i in range(leaves.shape[1])]
for j in range(leaves.shape[0])
]
)
scores = pd.DataFrame(scores, columns=map(str, range(100)))
scores["is_callcenter"] = y_val
mapper_dict = {
"TSNE 2D": {
"params": {
"perplexity": 20,
"n_components": 2,
#
"n_jobs": 32,
"verbose": False,
},
"func": make_tsne,
},
"UMAP 2D": {
"params": {
"n_neighbors": 11,
"min_dist": 0.1,
"n_components": 2,
#
"n_jobs": 32,
"verbose": False,
},
"func": make_umap,
},
}
scores_drplotter = DimReductionPlotter()
_ = scores_drplotter.plot_dim_reduction(
scores, mapper_dict, list(map(str, range(100))), ("is_callcenter", True)
)

У обоих методов есть пересечение 2 классов, при этом все остальные разделяются более-менее хорошо. Они соответствуют пикам из гистограммы raw_score, где модель не уверена в метке
Облачко на сессиях :-)¶
web_aggregate_features = [
"session_length",
"page_type_3",
"page_type_6",
"last_pageview_number",
# "pageview_duration_sec_last",
"pageview_duration_sec_sum",
# "pageview_duration_sec_min",
# "pageview_duration_sec_max",
# "price_min",
# "price_max",
"datetime_diff",
"timedelta_1",
"page_type_3_proportion",
"page_type_6_proportion",
]
mapper_dict = {
"TSNE 2D perplexity=15": {
"params": {
"perplexity": 15,
"n_components": 2,
#
"n_jobs": 32,
"verbose": False,
},
"func": make_tsne,
},
"TSNE 2D perplexity=30": {
"params": {
"perplexity": 30,
"n_components": 2,
#
"n_jobs": 32,
"verbose": False,
},
"func": make_tsne,
},
}
drplotter = DimReductionPlotter()
results = drplotter.plot_dim_reduction(
web_aggregate[
(web_aggregate.index.isin(set(tr["sessionkey_id"]) | set(tst["sessionkey_id"])))
],
mapper_dict,
web_aggregate_features,
)

Что-то интересное! 1 облако точек
clusterer = hdbscan.HDBSCAN(
min_cluster_size=100,
gen_min_span_tree=True,
prediction_data=True,
)
clusters = clusterer.fit_predict(results["TSNE 2D perplexity=30"]["embedding"])
sns.scatterplot(
x=results["TSNE 2D perplexity=30"]["embedding"][:, 0],
y=results["TSNE 2D perplexity=30"]["embedding"][:, 1],
hue=clusters,
)
plt.show()
web_aggregate.loc[
(web_aggregate.index.isin(set(tr["sessionkey_id"]) | set(tst["sessionkey_id"]))),
"cluster_id",
] = clusters
X, y = transform(tr, web_aggregate)
X_tr, X_val, y_tr, y_val = train_test_split(X, y, test_size=0.25, shuffle=False)
train_dataset = lgb.Dataset(
X_tr, y_tr, categorical_feature=cat_features + ["cluster_id"]
)
val_dataset = lgb.Dataset(
X_val, y_val, categorical_feature=cat_features + ["cluster_id"]
)
model = lgb.train(
{
"boosting_type": "dart",
"eta": 0.15,
"objective": "binary",
"metric": ["auc", ""],
"neg_bagging_fraction": 0.2,
},
train_dataset,
100,
[val_dataset],
["Validation"],
callbacks=[
lgb.log_evaluation(3),
],
)
t = model.trees_to_dataframe()
[LightGBM] [Warning] Met negative value in categorical features, will convert it to NaN [LightGBM] [Warning] Met categorical feature which contains sparse values. Consider renumbering to consecutive integers started from zero [LightGBM] [Info] Number of positive: 28110, number of negative: 50336 [LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.003494 seconds. You can set `force_col_wise=true` to remove the overhead. [LightGBM] [Info] Total Bins 5268 [LightGBM] [Info] Number of data points in the train set: 78446, number of used features: 31 [LightGBM] [Info] [binary:BoostFromScore]: pavg=0.358336 -> initscore=-0.582595 [LightGBM] [Info] Start training from score -0.582595 [3] Validation's auc: 0.957261 [6] Validation's auc: 0.957913 [9] Validation's auc: 0.958335 [12] Validation's auc: 0.959779 [15] Validation's auc: 0.960384 [18] Validation's auc: 0.962162 [21] Validation's auc: 0.963195 [24] Validation's auc: 0.963843 [27] Validation's auc: 0.964151 [30] Validation's auc: 0.964368 [33] Validation's auc: 0.964561 [36] Validation's auc: 0.964596 [39] Validation's auc: 0.964716 [42] Validation's auc: 0.964841 [45] Validation's auc: 0.9649 [48] Validation's auc: 0.965022 [51] Validation's auc: 0.965084 [54] Validation's auc: 0.965132 [57] Validation's auc: 0.965187 [60] Validation's auc: 0.965206 [63] Validation's auc: 0.965315 [66] Validation's auc: 0.965247 [69] Validation's auc: 0.965418 [72] Validation's auc: 0.965437 [75] Validation's auc: 0.965394 [78] Validation's auc: 0.965377 [81] Validation's auc: 0.965421 [84] Validation's auc: 0.965449 [87] Validation's auc: 0.965543 [90] Validation's auc: 0.965529 [93] Validation's auc: 0.965575 [96] Validation's auc: 0.965572 [99] Validation's auc: 0.965552
dict(zip(X_tr.columns, model.feature_importance("gain")))["cluster_id"]
0.0
Не особо полезная фича :* (Когда запускал 1 раз, оно хотя бы 1 раз делило по нему)
web_aggregate.drop(columns=["cluster_id"], inplace=True, errors="ignore")
X, y = transform(tr, web_aggregate)
X_tr, X_val, y_tr, y_val = train_test_split(X, y, test_size=0.25, shuffle=False)
Мини итог¶
В целом мне кажется понижать размерность и пытаться кластеризовать на этих данных не особо много смысла. Основная информация содержится в парочке переменных, на которых сложно ввести нормальную метрику, потому что они не численные
Но я попытался
2. Кластеризация (3 балла)¶
В своей модели я не использовал информацию о товарах из сессий и хочу исправить это
Так как это категориальные фичи будет довольно сложно сделать это метрически (я пробовал метрику Jaccard и Hamming, но в матрице расстояний было только 2 уникальных значения 0 на диагонали и 1 в остальных случаях :/), поэтому я решил попробовать для этого дерево
Кластеризация через листья¶
web_cat = web.loc[
web["page_type"].isin([1, 2, 4]),
["sessionkey_id", "category_id", "model_id", "good_id", "price"],
].set_index("sessionkey_id")
web_cat
| category_id | model_id | good_id | price | |
|---|---|---|---|---|
| sessionkey_id | ||||
| 109996122 | 722.0 | NaN | NaN | NaN |
| 109996122 | 7196.0 | NaN | NaN | NaN |
| 109996122 | 779.0 | NaN | NaN | NaN |
| 109996122 | 7196.0 | NaN | NaN | NaN |
| 109996122 | 723.0 | NaN | NaN | NaN |
| ... | ... | ... | ... | ... |
| 134628743 | 127.0 | NaN | NaN | NaN |
| 134628743 | 127.0 | 9401923.0 | NaN | NaN |
| 134628743 | 127.0 | NaN | NaN | NaN |
| 134628743 | 127.0 | 17200183.0 | NaN | NaN |
| 134629277 | NaN | NaN | NaN | NaN |
2483376 rows × 4 columns
cluster_tr = tr.join(web_cat, "sessionkey_id", "inner", lsuffix="_tr")[
[
"category_id",
"model_id",
"good_id",
"price",
"is_callcenter",
]
]
X_cluster, y_cluster = (
cluster_tr.drop(columns="is_callcenter"),
cluster_tr.is_callcenter.values,
)
tree = lgb.train({"num_leaves": 10}, lgb.Dataset(X_cluster, y_cluster), 1)
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.002363 seconds. You can set `force_row_wise=true` to remove the overhead. And if memory is not enough, you can set `force_col_wise=true`. [LightGBM] [Info] Total Bins 1020 [LightGBM] [Info] Number of data points in the train set: 1046532, number of used features: 4 [LightGBM] [Info] Start training from score 0.285141
tree.trees_to_dataframe()
| tree_index | node_depth | node_index | left_child | right_child | parent_index | split_feature | split_gain | threshold | decision_type | missing_direction | missing_type | value | weight | count | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 1 | 0-S0 | 0-S1 | 0-S5 | None | price | 1464.050049 | 793.5 | <= | left | NaN | 0.285141 | 0 | 1046532 |
| 1 | 0 | 2 | 0-S1 | 0-S2 | 0-L2 | 0-S0 | category_id | 635.273010 | 3974.5 | <= | right | NaN | 0.282752 | 743326 | 743326 |
| 2 | 0 | 3 | 0-S2 | 0-S4 | 0-S3 | 0-S1 | category_id | 278.243988 | 132.5 | <= | left | NaN | 0.284679 | 518231 | 518231 |
| 3 | 0 | 4 | 0-S4 | 0-L0 | 0-L5 | 0-S2 | category_id | 139.705994 | 125.5 | <= | left | NaN | 0.293974 | 30318 | 30318 |
| 4 | 0 | 5 | 0-L0 | None | None | 0-S4 | None | NaN | NaN | None | None | None | 0.280335 | 6019 | 6019 |
| 5 | 0 | 5 | 0-L5 | None | None | 0-S4 | None | NaN | NaN | None | None | None | 0.297353 | 24299 | 24299 |
| 6 | 0 | 4 | 0-S3 | 0-L3 | 0-S6 | 0-S2 | price | 175.552002 | 541.5 | <= | right | NaN | 0.284101 | 487913 | 487913 |
| 7 | 0 | 5 | 0-L3 | None | None | 0-S3 | None | NaN | NaN | None | None | None | 0.280539 | 107786 | 107786 |
| 8 | 0 | 5 | 0-S6 | 0-S7 | 0-L7 | 0-S3 | category_id | 113.382004 | 1326.5 | <= | left | NaN | 0.285111 | 380127 | 380127 |
| 9 | 0 | 6 | 0-S7 | 0-L4 | 0-L8 | 0-S6 | category_id | 145.516006 | 1265.5 | <= | left | NaN | 0.283248 | 175707 | 175707 |
| 10 | 0 | 7 | 0-L4 | None | None | 0-S7 | None | NaN | NaN | None | None | None | 0.284015 | 164063 | 164063 |
| 11 | 0 | 7 | 0-L8 | None | None | 0-S7 | None | NaN | NaN | None | None | None | 0.272446 | 11644 | 11644 |
| 12 | 0 | 6 | 0-L7 | None | None | 0-S6 | None | NaN | NaN | None | None | None | 0.286712 | 204420 | 204420 |
| 13 | 0 | 3 | 0-L2 | None | None | 0-S1 | None | NaN | NaN | None | None | None | 0.278316 | 225095 | 225095 |
| 14 | 0 | 2 | 0-S5 | 0-L1 | 0-S8 | 0-S0 | model_id | 122.767998 | 5943291.5 | <= | left | NaN | 0.290997 | 303206 | 303206 |
| 15 | 0 | 3 | 0-L1 | None | None | 0-S5 | None | NaN | NaN | None | None | None | 0.293285 | 132269 | 132269 |
| 16 | 0 | 3 | 0-S8 | 0-L6 | 0-L9 | 0-S5 | category_id | 87.764198 | 7063.5 | <= | left | NaN | 0.289227 | 170937 | 170937 |
| 17 | 0 | 4 | 0-L6 | None | None | 0-S8 | None | NaN | NaN | None | None | None | 0.289540 | 167738 | 167738 |
| 18 | 0 | 4 | 0-L9 | None | None | 0-S8 | None | NaN | NaN | None | None | None | 0.272819 | 3199 | 3199 |
Добавим метку кластера в элемент сессии, в которую он попал
web.loc[web["page_type"].isin([1, 2, 4]), "cluster_id"] = tree.predict(
web_cat, pred_leaf=True
)
web.head()
| sessionkey_id | date_time | page_type | pageview_number | pageview_duration_sec | category_id | model_id | good_id | price | product_in_sale | cluster_id | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 2268917 | 109996122 | 1975-10-17 13:42:56.953 | 2 | 1 | 11.0 | 722.0 | NaN | NaN | NaN | NaN | 4.0 |
| 2268918 | 109996122 | 1975-10-17 13:43:07.510 | 2 | 2 | 22.0 | 7196.0 | NaN | NaN | NaN | NaN | 2.0 |
| 2268919 | 109996122 | 1975-10-17 13:43:29.860 | 2 | 3 | 25.0 | 779.0 | NaN | NaN | NaN | NaN | 4.0 |
| 2269206 | 109996122 | 1975-10-17 13:43:54.757 | 2 | 4 | 9.0 | 7196.0 | NaN | NaN | NaN | NaN | 2.0 |
| 2267445 | 109996122 | 1975-10-17 13:44:03.803 | 2 | 5 | 11.0 | 723.0 | NaN | NaN | NaN | NaN | 4.0 |
Теперь будем агрегировать инфу о кластерах
agg_params = {
"session_length": ("sessionkey_id", lambda x: x.shape[0]),
#
"session_datetime_start": ("date_time", lambda x: x.iloc[0]),
"session_datetime_end": ("date_time", lambda x: x.iloc[-1]),
#
"last_page_type": ("page_type", lambda x: x.iloc[-1]),
**{
f"page_type_{i}": ("page_type", partial(lambda x, i: x[x == i].count(), i=i))
for i in (3, 6)
},
#
# **{
# f"page_type_{i}": ("page_type", partial(lambda x, i: x[x == i].count(), i=i))
# for i in range(1, 13 + 1)
# }, # Самыми полезными получились 3 и 6, чтоб долго не считать сделал только их
#
#
"last_pageview_number": ("pageview_number", lambda x: x.max()),
#
"pageview_duration_sec_last": ("pageview_duration_sec", lambda x: x.iloc[-1]),
"pageview_duration_sec_sum": ("pageview_duration_sec", lambda x: np.nansum(x)),
"pageview_duration_sec_min": ("pageview_duration_sec", lambda x: x.min()),
"pageview_duration_sec_max": ("pageview_duration_sec", lambda x: x.max()),
#
"categories": ("category_id", lambda x: set(x[~x.isna()].astype(int))),
#
"models": ("model_id", lambda x: set(x[~x.isna()].astype(int))),
#
"goods": ("good_id", lambda x: set(x[~x.isna()].astype(int))),
#
"price_min": ("price", lambda x: x.min()),
"price_max": ("price", lambda x: x.max()),
#
**{
f"cluster_id_{i}": ("cluster_id", partial(lambda x, i: x[x == i].count(), i=i))
for i in range(10)
},
}
web_aggregate = web.groupby("sessionkey_id", sort=False).agg(**agg_params)
web_aggregate["datetime_diff"] = (
web_aggregate["session_datetime_end"] - web_aggregate["session_datetime_start"]
).dt.total_seconds()
web_aggregate["timedelta_1"] = (
web_aggregate["datetime_diff"] - web_aggregate["pageview_duration_sec_sum"]
)
for i in (3, 6):
web_aggregate[f"page_type_{i}_proportion"] = (
web_aggregate[f"page_type_{i}"] / web_aggregate["session_length"]
)
# for i in range(1, 13 + 1):
# web_aggregate[f"page_type_{i}_proportion"] = (
# web_aggregate[f"page_type_{i}"] / web_aggregate["session_length"]
# )
web_aggregate.sample(5)
| session_length | session_datetime_start | session_datetime_end | last_page_type | page_type_3 | page_type_6 | last_pageview_number | pageview_duration_sec_last | pageview_duration_sec_sum | pageview_duration_sec_min | ... | cluster_id_4 | cluster_id_5 | cluster_id_6 | cluster_id_7 | cluster_id_8 | cluster_id_9 | datetime_diff | timedelta_1 | page_type_3_proportion | page_type_6_proportion | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| sessionkey_id | |||||||||||||||||||||
| 120615479 | 17 | 1975-12-10 12:22:21.907 | 1975-12-10 12:45:09.480 | 1 | 0 | 0 | 17 | NaN | 1368.0 | 9.0 | ... | 7 | 0 | 3 | 2 | 0 | 0 | 1367.573 | -0.427 | 0.000000 | 0.000000 |
| 126391440 | 1 | 1976-01-09 10:45:30.097 | 1976-01-09 10:45:30.097 | 6 | 0 | 1 | 4 | NaN | 0.0 | NaN | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0.000 | 0.000 | 0.000000 | 1.000000 |
| 117443192 | 11 | 1975-11-25 14:15:39.247 | 1975-11-25 14:22:51.317 | 2 | 0 | 0 | 11 | NaN | 432.0 | 5.0 | ... | 0 | 0 | 0 | 6 | 0 | 0 | 432.070 | 0.070 | 0.000000 | 0.000000 |
| 115670979 | 30 | 1975-11-16 10:27:49.977 | 1975-11-16 10:39:49.263 | 2 | 1 | 1 | 35 | NaN | 534.0 | 0.0 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 719.286 | 185.286 | 0.033333 | 0.033333 |
| 129596506 | 3 | 1976-01-23 07:54:26.957 | 1976-01-23 08:11:13.260 | 9 | 0 | 0 | 3 | NaN | 1007.0 | 16.0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1006.303 | -0.697 | 0.000000 | 0.000000 |
5 rows × 30 columns
X, y = transform(tr, web_aggregate)
X_tr, X_val, y_tr, y_val = train_test_split(X, y, test_size=0.25, shuffle=False)
train_dataset = lgb.Dataset(X_tr, y_tr, categorical_feature=cat_features)
val_dataset = lgb.Dataset(X_val, y_val, categorical_feature=cat_features)
model_clusters = lgb.train(
{
"boosting_type": "dart",
"eta": 0.15,
"objective": "binary",
"metric": ["auc", ""],
"neg_bagging_fraction": 0.2,
},
train_dataset,
300,
[val_dataset],
["Validation"],
callbacks=[
lgb.log_evaluation(3),
],
)
t_clusters = model_clusters.trees_to_dataframe()
[LightGBM] [Warning] Met categorical feature which contains sparse values. Consider renumbering to consecutive integers started from zero [LightGBM] [Info] Number of positive: 28110, number of negative: 50336
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.093527 seconds. You can set `force_row_wise=true` to remove the overhead. And if memory is not enough, you can set `force_col_wise=true`. [LightGBM] [Info] Total Bins 5760 [LightGBM] [Info] Number of data points in the train set: 78446, number of used features: 40 [LightGBM] [Info] [binary:BoostFromScore]: pavg=0.358336 -> initscore=-0.582595 [LightGBM] [Info] Start training from score -0.582595 [3] Validation's auc: 0.957261 [6] Validation's auc: 0.957913 [9] Validation's auc: 0.958335 [12] Validation's auc: 0.959779 [15] Validation's auc: 0.960384 [18] Validation's auc: 0.962166 [21] Validation's auc: 0.963215 [24] Validation's auc: 0.96378 [27] Validation's auc: 0.963956 [30] Validation's auc: 0.964226 [33] Validation's auc: 0.964413 [36] Validation's auc: 0.964544 [39] Validation's auc: 0.964682 [42] Validation's auc: 0.964783 [45] Validation's auc: 0.964912 [48] Validation's auc: 0.964983 [51] Validation's auc: 0.964963 [54] Validation's auc: 0.965076 [57] Validation's auc: 0.965076 [60] Validation's auc: 0.965144 [63] Validation's auc: 0.965155 [66] Validation's auc: 0.965159 [69] Validation's auc: 0.965136 [72] Validation's auc: 0.965118 [75] Validation's auc: 0.965151 [78] Validation's auc: 0.965196 [81] Validation's auc: 0.964963 [84] Validation's auc: 0.965073 [87] Validation's auc: 0.965156 [90] Validation's auc: 0.965156 [93] Validation's auc: 0.965201 [96] Validation's auc: 0.9652 [99] Validation's auc: 0.965145 [102] Validation's auc: 0.965152 [105] Validation's auc: 0.965177 [108] Validation's auc: 0.965248 [111] Validation's auc: 0.965222 [114] Validation's auc: 0.96521 [117] Validation's auc: 0.965241 [120] Validation's auc: 0.96526 [123] Validation's auc: 0.965241 [126] Validation's auc: 0.965242 [129] Validation's auc: 0.965023 [132] Validation's auc: 0.965168 [135] Validation's auc: 0.965172 [138] Validation's auc: 0.965179 [141] Validation's auc: 0.965124 [144] Validation's auc: 0.965155 [147] Validation's auc: 0.965172 [150] Validation's auc: 0.965205 [153] Validation's auc: 0.965244 [156] Validation's auc: 0.965233 [159] Validation's auc: 0.965269 [162] Validation's auc: 0.96539 [165] Validation's auc: 0.965263 [168] Validation's auc: 0.965247 [171] Validation's auc: 0.965079 [174] Validation's auc: 0.96517 [177] Validation's auc: 0.965165 [180] Validation's auc: 0.965123 [183] Validation's auc: 0.965137 [186] Validation's auc: 0.965121 [189] Validation's auc: 0.965145 [192] Validation's auc: 0.965119 [195] Validation's auc: 0.965188 [198] Validation's auc: 0.965137 [201] Validation's auc: 0.965055 [204] Validation's auc: 0.965114 [207] Validation's auc: 0.965174 [210] Validation's auc: 0.965186 [213] Validation's auc: 0.9652 [216] Validation's auc: 0.965047 [219] Validation's auc: 0.965016 [222] Validation's auc: 0.96492 [225] Validation's auc: 0.964921 [228] Validation's auc: 0.964938 [231] Validation's auc: 0.964894 [234] Validation's auc: 0.964995 [237] Validation's auc: 0.965042 [240] Validation's auc: 0.96501 [243] Validation's auc: 0.965039 [246] Validation's auc: 0.965047 [249] Validation's auc: 0.96511 [252] Validation's auc: 0.965115 [255] Validation's auc: 0.965069 [258] Validation's auc: 0.965039 [261] Validation's auc: 0.965056 [264] Validation's auc: 0.965096 [267] Validation's auc: 0.965034 [270] Validation's auc: 0.965041 [273] Validation's auc: 0.965008 [276] Validation's auc: 0.965048 [279] Validation's auc: 0.965023 [282] Validation's auc: 0.964977 [285] Validation's auc: 0.965005 [288] Validation's auc: 0.964979 [291] Validation's auc: 0.965066 [294] Validation's auc: 0.965078 [297] Validation's auc: 0.965108 [300] Validation's auc: 0.965139
plot_feature_info(t_clusters)

Это идея тоже оказалась не самой полезной :/
web_aggregate.drop(
columns=[f"cluster_id_{i}" for i in range(10)], errors="ignore", inplace=True
)
3. Ближайшие соседи (3 балла)¶
В силу того, что в понижении размерности получилось мало чего хорошего (из-за типов фичей), мне кажется не особо разумно пытаться использовать ближайших соседей. С кластеризацией ещё можно, что-то придумать на деревьях, но обычные деревья как раз и занимаются тем, что ищут похожие объекты. Так что я не буду изобретать велосипед :*
4. lightgbm: model.trees_to_dataframe (5 баллов)¶
Всё что должно быть в этом блоке лежит в EDA -> Обучение и анализ модели, потому что я использую это в первых пунктах
5. catboost: model.get_object_importance (4 + 1 балла)¶
train_dataset = lgb.Dataset(X_tr, y_tr, categorical_feature=cat_features)
val_dataset = lgb.Dataset(X_val, y_val, categorical_feature=cat_features)
model = lgb.train(
{
"boosting_type": "dart",
"eta": 0.15,
"objective": "binary",
"metric": ["auc", ""],
"neg_bagging_fraction": 0.2,
},
train_dataset,
100,
[val_dataset],
["Validation"],
callbacks=[
lgb.log_evaluation(3),
],
)
t = model.trees_to_dataframe()
[LightGBM] [Warning] Met categorical feature which contains sparse values. Consider renumbering to consecutive integers started from zero [LightGBM] [Info] Number of positive: 28110, number of negative: 50336 [LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.010657 seconds. You can set `force_col_wise=true` to remove the overhead. [LightGBM] [Info] Total Bins 5760 [LightGBM] [Info] Number of data points in the train set: 78446, number of used features: 40 [LightGBM] [Info] [binary:BoostFromScore]: pavg=0.358336 -> initscore=-0.582595 [LightGBM] [Info] Start training from score -0.582595 [3] Validation's auc: 0.957261 [6] Validation's auc: 0.957913 [9] Validation's auc: 0.958335 [12] Validation's auc: 0.959779 [15] Validation's auc: 0.960384 [18] Validation's auc: 0.962166 [21] Validation's auc: 0.963215 [24] Validation's auc: 0.96378 [27] Validation's auc: 0.963956 [30] Validation's auc: 0.964226 [33] Validation's auc: 0.964413 [36] Validation's auc: 0.964544 [39] Validation's auc: 0.964682 [42] Validation's auc: 0.964783 [45] Validation's auc: 0.964912 [48] Validation's auc: 0.964983 [51] Validation's auc: 0.964963 [54] Validation's auc: 0.965076 [57] Validation's auc: 0.965076 [60] Validation's auc: 0.965144 [63] Validation's auc: 0.965155 [66] Validation's auc: 0.965159 [69] Validation's auc: 0.965136 [72] Validation's auc: 0.965118 [75] Validation's auc: 0.965151 [78] Validation's auc: 0.965196 [81] Validation's auc: 0.964963 [84] Validation's auc: 0.965073 [87] Validation's auc: 0.965156 [90] Validation's auc: 0.965156 [93] Validation's auc: 0.965201 [96] Validation's auc: 0.9652 [99] Validation's auc: 0.965145
plot_scores(model, X_tr, y_tr, X_val, y_val)
tr_pool = cb.Pool(X_tr, y_tr)
val_pool = cb.Pool(X_val, y_val)
catboost = cb.train(
tr_pool,
{"iterations": 100, "eval_metric": "AUC", "loss_function": "Logloss"},
eval_set=val_pool,
)
Learning rate set to 0.253436 0: test: 0.9505862 best: 0.9505862 (0) total: 55.4ms remaining: 5.48s 1: test: 0.9523762 best: 0.9523762 (1) total: 60.7ms remaining: 2.97s 2: test: 0.9563567 best: 0.9563567 (2) total: 65.1ms remaining: 2.1s 3: test: 0.9567952 best: 0.9567952 (3) total: 69.6ms remaining: 1.67s 4: test: 0.9586227 best: 0.9586227 (4) total: 74.2ms remaining: 1.41s 5: test: 0.9594953 best: 0.9594953 (5) total: 78.7ms remaining: 1.23s 6: test: 0.9600629 best: 0.9600629 (6) total: 83.3ms remaining: 1.11s 7: test: 0.9607234 best: 0.9607234 (7) total: 87.9ms remaining: 1.01s 8: test: 0.9607205 best: 0.9607234 (7) total: 92.5ms remaining: 935ms 9: test: 0.9608740 best: 0.9608740 (9) total: 97ms remaining: 873ms 10: test: 0.9611348 best: 0.9611348 (10) total: 102ms remaining: 822ms 11: test: 0.9611966 best: 0.9611966 (11) total: 106ms remaining: 780ms 12: test: 0.9614189 best: 0.9614189 (12) total: 111ms remaining: 742ms 13: test: 0.9617673 best: 0.9617673 (13) total: 115ms remaining: 709ms 14: test: 0.9620421 best: 0.9620421 (14) total: 120ms remaining: 682ms 15: test: 0.9620531 best: 0.9620531 (15) total: 125ms remaining: 655ms 16: test: 0.9624287 best: 0.9624287 (16) total: 129ms remaining: 632ms 17: test: 0.9627692 best: 0.9627692 (17) total: 134ms remaining: 610ms 18: test: 0.9628372 best: 0.9628372 (18) total: 138ms remaining: 589ms 19: test: 0.9628297 best: 0.9628372 (18) total: 142ms remaining: 570ms 20: test: 0.9628854 best: 0.9628854 (20) total: 147ms remaining: 552ms 21: test: 0.9629522 best: 0.9629522 (21) total: 151ms remaining: 535ms 22: test: 0.9628709 best: 0.9629522 (21) total: 155ms remaining: 520ms 23: test: 0.9628682 best: 0.9629522 (21) total: 160ms remaining: 505ms 24: test: 0.9629931 best: 0.9629931 (24) total: 164ms remaining: 492ms 25: test: 0.9630040 best: 0.9630040 (25) total: 168ms remaining: 479ms 26: test: 0.9630508 best: 0.9630508 (26) total: 173ms remaining: 467ms 27: test: 0.9630695 best: 0.9630695 (27) total: 177ms remaining: 456ms 28: test: 0.9631592 best: 0.9631592 (28) total: 182ms remaining: 445ms 29: test: 0.9631399 best: 0.9631592 (28) total: 187ms remaining: 435ms 30: test: 0.9631235 best: 0.9631592 (28) total: 191ms remaining: 425ms 31: test: 0.9632781 best: 0.9632781 (31) total: 196ms remaining: 416ms 32: test: 0.9632676 best: 0.9632781 (31) total: 200ms remaining: 406ms 33: test: 0.9633529 best: 0.9633529 (33) total: 205ms remaining: 398ms 34: test: 0.9634088 best: 0.9634088 (34) total: 209ms remaining: 388ms 35: test: 0.9634287 best: 0.9634287 (35) total: 213ms remaining: 379ms 36: test: 0.9634687 best: 0.9634687 (36) total: 218ms remaining: 371ms 37: test: 0.9634698 best: 0.9634698 (37) total: 222ms remaining: 362ms 38: test: 0.9634747 best: 0.9634747 (38) total: 226ms remaining: 354ms 39: test: 0.9634713 best: 0.9634747 (38) total: 231ms remaining: 346ms 40: test: 0.9635600 best: 0.9635600 (40) total: 235ms remaining: 339ms 41: test: 0.9635121 best: 0.9635600 (40) total: 240ms remaining: 331ms 42: test: 0.9635283 best: 0.9635600 (40) total: 244ms remaining: 324ms 43: test: 0.9635406 best: 0.9635600 (40) total: 248ms remaining: 316ms 44: test: 0.9631217 best: 0.9635600 (40) total: 253ms remaining: 309ms 45: test: 0.9631152 best: 0.9635600 (40) total: 258ms remaining: 302ms 46: test: 0.9630669 best: 0.9635600 (40) total: 262ms remaining: 296ms 47: test: 0.9630985 best: 0.9635600 (40) total: 266ms remaining: 289ms 48: test: 0.9631310 best: 0.9635600 (40) total: 271ms remaining: 282ms 49: test: 0.9631896 best: 0.9635600 (40) total: 275ms remaining: 275ms 50: test: 0.9631818 best: 0.9635600 (40) total: 279ms remaining: 269ms 51: test: 0.9632181 best: 0.9635600 (40) total: 284ms remaining: 262ms 52: test: 0.9632117 best: 0.9635600 (40) total: 288ms remaining: 255ms 53: test: 0.9632418 best: 0.9635600 (40) total: 292ms remaining: 249ms 54: test: 0.9631888 best: 0.9635600 (40) total: 297ms remaining: 243ms 55: test: 0.9631792 best: 0.9635600 (40) total: 301ms remaining: 236ms 56: test: 0.9633803 best: 0.9635600 (40) total: 305ms remaining: 230ms 57: test: 0.9634897 best: 0.9635600 (40) total: 310ms remaining: 224ms 58: test: 0.9635149 best: 0.9635600 (40) total: 315ms remaining: 219ms 59: test: 0.9634975 best: 0.9635600 (40) total: 319ms remaining: 213ms 60: test: 0.9634957 best: 0.9635600 (40) total: 323ms remaining: 207ms 61: test: 0.9634748 best: 0.9635600 (40) total: 327ms remaining: 201ms 62: test: 0.9634809 best: 0.9635600 (40) total: 332ms remaining: 195ms 63: test: 0.9634680 best: 0.9635600 (40) total: 336ms remaining: 189ms 64: test: 0.9634749 best: 0.9635600 (40) total: 341ms remaining: 183ms 65: test: 0.9634842 best: 0.9635600 (40) total: 345ms remaining: 177ms 66: test: 0.9635098 best: 0.9635600 (40) total: 349ms remaining: 172ms 67: test: 0.9634321 best: 0.9635600 (40) total: 353ms remaining: 166ms 68: test: 0.9634847 best: 0.9635600 (40) total: 358ms remaining: 161ms 69: test: 0.9634847 best: 0.9635600 (40) total: 362ms remaining: 155ms 70: test: 0.9635474 best: 0.9635600 (40) total: 366ms remaining: 149ms 71: test: 0.9635476 best: 0.9635600 (40) total: 370ms remaining: 144ms 72: test: 0.9635591 best: 0.9635600 (40) total: 375ms remaining: 139ms 73: test: 0.9635429 best: 0.9635600 (40) total: 379ms remaining: 133ms 74: test: 0.9635252 best: 0.9635600 (40) total: 383ms remaining: 128ms 75: test: 0.9635370 best: 0.9635600 (40) total: 388ms remaining: 123ms 76: test: 0.9635671 best: 0.9635671 (76) total: 392ms remaining: 117ms 77: test: 0.9635759 best: 0.9635759 (77) total: 397ms remaining: 112ms 78: test: 0.9635837 best: 0.9635837 (78) total: 401ms remaining: 107ms 79: test: 0.9636266 best: 0.9636266 (79) total: 405ms remaining: 101ms 80: test: 0.9636528 best: 0.9636528 (80) total: 410ms remaining: 96.1ms 81: test: 0.9636463 best: 0.9636528 (80) total: 414ms remaining: 90.9ms 82: test: 0.9636482 best: 0.9636528 (80) total: 419ms remaining: 85.8ms 83: test: 0.9636153 best: 0.9636528 (80) total: 423ms remaining: 80.6ms 84: test: 0.9636393 best: 0.9636528 (80) total: 427ms remaining: 75.4ms 85: test: 0.9636746 best: 0.9636746 (85) total: 432ms remaining: 70.3ms 86: test: 0.9636464 best: 0.9636746 (85) total: 436ms remaining: 65.1ms 87: test: 0.9621741 best: 0.9636746 (85) total: 440ms remaining: 60ms 88: test: 0.9621743 best: 0.9636746 (85) total: 444ms remaining: 54.9ms 89: test: 0.9621426 best: 0.9636746 (85) total: 448ms remaining: 49.8ms 90: test: 0.9621527 best: 0.9636746 (85) total: 453ms remaining: 44.8ms 91: test: 0.9621559 best: 0.9636746 (85) total: 457ms remaining: 39.7ms 92: test: 0.9621843 best: 0.9636746 (85) total: 461ms remaining: 34.7ms 93: test: 0.9622250 best: 0.9636746 (85) total: 465ms remaining: 29.7ms 94: test: 0.9622106 best: 0.9636746 (85) total: 470ms remaining: 24.7ms 95: test: 0.9621857 best: 0.9636746 (85) total: 474ms remaining: 19.8ms 96: test: 0.9622258 best: 0.9636746 (85) total: 479ms remaining: 14.8ms 97: test: 0.9622152 best: 0.9636746 (85) total: 483ms remaining: 9.85ms 98: test: 0.9622289 best: 0.9636746 (85) total: 487ms remaining: 4.92ms 99: test: 0.9623737 best: 0.9636746 (85) total: 491ms remaining: 0us bestTest = 0.9636745675 bestIteration = 85 Shrink model to first 86 iterations.
y_val_raw = catboost.predict(X_val, prediction_type="RawFormulaVal")
plt.title("Validation raw")
sns.histplot(x=y_val_raw, hue=y_val, bins=33)
plt.show()
Как и в lightgbm в catboost у нас есть области примерно от -2 до 2, где модель неуверенна в ответе и ошибается. Посмотрим какие объекты влияют на это
undefined_pool = cb.Pool(X_val[np.abs(y_val_raw) < 2], y_val[np.abs(y_val_raw) < 2])
undefined_pool.shape
(7155, 40)
indeces, scores = catboost.get_object_importance(
undefined_pool.slice(np.arange(10)),
tr_pool,
top_size=5, # не влияет на скорость
type="PerObject", # не влияет на скорость
update_method="SinglePoint", # очень влияет на скорость! лучше SinglePoint :)
importance_values_sign="All", # не влияет на скорость
thread_count=32,
)
for i in range(10):
print(f"index={i}, real_target={y_val[np.abs(y_val_raw) < 2][i]}")
display(X_tr.iloc[indeces[i]][["page_type_3", "timedelta_3", "timedelta_1"]])
display(y_tr[indeces[i]])
index=0, real_target=1
| page_type_3 | timedelta_3 | timedelta_1 | |
|---|---|---|---|
| order_id | |||
| 1175248 | 0.0 | -595598.777 | -0.024 |
| 1164495 | 0.0 | -519842.500 | -0.330 |
| 1253852 | 2.0 | -434375.977 | -0.484 |
| 1193397 | 0.0 | -526297.223 | 0.030 |
| 1171789 | 0.0 | -568726.220 | 0.487 |
array([1, 1, 1, 1, 1])
index=1, real_target=1
| page_type_3 | timedelta_3 | timedelta_1 | |
|---|---|---|---|
| order_id | |||
| 1259279 | 0.0 | -320871.673 | -0.230 |
| 1202542 | 1.0 | -2119.147 | -24.567 |
| 1281626 | 0.0 | -7150.057 | 0.720 |
| 1234980 | 0.0 | -9649.763 | -0.236 |
| 1297762 | 0.0 | -14707.633 | -0.333 |
array([0, 1, 1, 0, 0])
index=2, real_target=0
| page_type_3 | timedelta_3 | timedelta_1 | |
|---|---|---|---|
| order_id | |||
| 1302382 | 5.0 | 281.060 | 295.717 |
| 1314834 | 3.0 | 586.417 | 1289.560 |
| 1328959 | 10.0 | 125.847 | 713.440 |
| 1212635 | 2.0 | 1668.270 | 1258.523 |
| 1231565 | 5.0 | 4366.357 | 880.744 |
array([1, 1, 1, 1, 1])
index=3, real_target=0
| page_type_3 | timedelta_3 | timedelta_1 | |
|---|---|---|---|
| order_id | |||
| 1195237 | 1.0 | -2382.353 | -42.306 |
| 1301008 | 2.0 | -1421.270 | -0.130 |
| 1211465 | 1.0 | -320.433 | -17.820 |
| 1334601 | 0.0 | -1170.037 | 0.750 |
| 1236722 | 1.0 | -1861.740 | 0.117 |
array([1, 0, 1, 0, 0])
index=4, real_target=0
| page_type_3 | timedelta_3 | timedelta_1 | |
|---|---|---|---|
| order_id | |||
| 1213396 | 2.0 | -72394.113 | 0.524 |
| 1338806 | 2.0 | -413448.743 | -15.860 |
| 1253422 | 0.0 | -311268.510 | 0.000 |
| 1261851 | 0.0 | -605585.217 | 0.130 |
| 1180464 | 0.0 | -410573.487 | -0.294 |
array([1, 1, 1, 1, 1])
index=5, real_target=0
| page_type_3 | timedelta_3 | timedelta_1 | |
|---|---|---|---|
| order_id | |||
| 1262444 | 0.0 | -519054.693 | 0.0 |
| 1288013 | 0.0 | -358709.213 | 0.0 |
| 1283500 | 0.0 | -436569.980 | 0.0 |
| 1181366 | 0.0 | -389951.583 | 0.0 |
| 1245312 | 0.0 | -518868.227 | 0.0 |
array([1, 1, 1, 1, 1])
index=6, real_target=0
| page_type_3 | timedelta_3 | timedelta_1 | |
|---|---|---|---|
| order_id | |||
| 1262921 | 1.0 | -178.763 | -23.976 |
| 1324028 | 0.0 | -4629.663 | -0.200 |
| 1264136 | 2.0 | -1497.023 | 25.687 |
| 1278684 | 1.0 | -10831.283 | 0.694 |
| 1302861 | 0.0 | -3170.397 | -0.294 |
array([1, 1, 0, 1, 1])
index=7, real_target=1
| page_type_3 | timedelta_3 | timedelta_1 | |
|---|---|---|---|
| order_id | |||
| 1218925 | 0.0 | -6455.033 | 0.0 |
| 1324091 | 0.0 | -12338.623 | 0.0 |
| 1332701 | 0.0 | -74900.727 | 0.0 |
| 1292198 | 0.0 | -12986.190 | 0.0 |
| 1331822 | 0.0 | -28757.787 | 0.0 |
array([0, 0, 0, 0, 0])
index=8, real_target=1
| page_type_3 | timedelta_3 | timedelta_1 | |
|---|---|---|---|
| order_id | |||
| 1210623 | 4.0 | -357.677 | 376.873 |
| 1292705 | 4.0 | -1739.663 | 168.614 |
| 1276537 | 2.0 | -300.260 | 411.833 |
| 1160067 | 2.0 | -310.787 | 169.263 |
| 1228244 | 1.0 | -2013.380 | 1515.240 |
array([0, 0, 1, 0, 0])
index=9, real_target=0
| page_type_3 | timedelta_3 | timedelta_1 | |
|---|---|---|---|
| order_id | |||
| 1262444 | 0.0 | -519054.693 | 0.000 |
| 1212600 | 0.0 | -256489.260 | 0.440 |
| 1164070 | 0.0 | -432832.037 | 0.186 |
| 1177533 | 0.0 | -408052.063 | 0.607 |
| 1313622 | 0.0 | -502089.843 | 0.000 |
array([1, 1, 1, 1, 1])
Заметим, что для индексов (4, 5, 9) реальная метка - 0, но при этом самые важные объекты для них с меткой 1 и при этом с огромным timedelta_3 похожим на выброс. Так что может быть если пофиксить эту штуку можно получить больший скор
timedelta_3_clip_lower = -100000 # По хорошему тут бы подобрать константу
X_tr["timedelta_3"].clip(timedelta_3_clip_lower, inplace=True)
X_val["timedelta_3"].clip(timedelta_3_clip_lower, inplace=True)
fig, ax = plt.subplots(1, 2, figsize=(18, 6))
sns.histplot(x=X_tr["timedelta_3"], hue=y_tr, bins=33, ax=ax[0])
sns.histplot(x=X_val["timedelta_3"], hue=y_val, bins=33, ax=ax[1])
plt.show()
train_dataset = lgb.Dataset(X_tr, y_tr, categorical_feature=cat_features)
val_dataset = lgb.Dataset(X_val, y_val, categorical_feature=cat_features)
lgbm_clf = lgb.train(
{
"boosting_type": "dart",
"eta": 0.15,
"objective": "binary",
"metric": ["auc", ""],
"neg_bagging_fraction": 0.2,
},
train_dataset,
100,
[val_dataset],
["Validation"],
callbacks=[
lgb.log_evaluation(3),
],
)
t = lgbm_clf.trees_to_dataframe()
[LightGBM] [Warning] Met categorical feature which contains sparse values. Consider renumbering to consecutive integers started from zero [LightGBM] [Info] Number of positive: 28110, number of negative: 50336 [LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.005091 seconds. You can set `force_col_wise=true` to remove the overhead. [LightGBM] [Info] Total Bins 5760 [LightGBM] [Info] Number of data points in the train set: 78446, number of used features: 40 [LightGBM] [Info] [binary:BoostFromScore]: pavg=0.358336 -> initscore=-0.582595 [LightGBM] [Info] Start training from score -0.582595 [3] Validation's auc: 0.955965 [6] Validation's auc: 0.957652 [9] Validation's auc: 0.958019 [12] Validation's auc: 0.959067 [15] Validation's auc: 0.960102 [18] Validation's auc: 0.962364 [21] Validation's auc: 0.963027 [24] Validation's auc: 0.963588 [27] Validation's auc: 0.963997 [30] Validation's auc: 0.964133 [33] Validation's auc: 0.964295 [36] Validation's auc: 0.964474 [39] Validation's auc: 0.964641 [42] Validation's auc: 0.964692 [45] Validation's auc: 0.964785 [48] Validation's auc: 0.96481 [51] Validation's auc: 0.964913 [54] Validation's auc: 0.964931 [57] Validation's auc: 0.96493 [60] Validation's auc: 0.965037 [63] Validation's auc: 0.96514 [66] Validation's auc: 0.96517 [69] Validation's auc: 0.965154 [72] Validation's auc: 0.965185 [75] Validation's auc: 0.965147 [78] Validation's auc: 0.965166 [81] Validation's auc: 0.965195 [84] Validation's auc: 0.965306 [87] Validation's auc: 0.965371 [90] Validation's auc: 0.965368 [93] Validation's auc: 0.965463 [96] Validation's auc: 0.965445 [99] Validation's auc: 0.965438
Стало чуть лучше. Но сильного роста не произошло :/
6. SHAP (5 баллов)¶
lgbm_explainer = shap.TreeExplainer(model)
lgbm_shap_values = lgbm_explainer(X_val)
shap.plots.beeswarm(lgbm_shap_values, max_display=10)
lgbm использует:
page_type_3иpage_type_6в формате "> 0" -> 0 и "== 0" -> 1pageview_duration_sec_last: "not nan" -> 0 и "nan" -> 1
cb_explainer = shap.TreeExplainer(catboost)
cb_shap_values = cb_explainer(X_val)
shap.plots.beeswarm(cb_shap_values, max_display=10)
catboost почему-то вообще не использует page_type_3. Возможно поэтому у него меньше скор, но что бы я не делал он его не выводит на 1 место, поэтому я думаю пользоваться lgbm (Как я это делал не сохранилось, но я надеюсь на веру на слово)
Так что тут тоже не получилось извлечь какую-то выгоды для задачи
Сдача¶
Я решил гордо не оптюнить!
X["timedelta_3"].clip(timedelta_3_clip_lower, inplace=True)
train_dataset = lgb.Dataset(X, y, categorical_feature=cat_features)
model = lgb.train(
{
"boosting_type": "dart",
"eta": 0.15,
"objective": "binary",
"metric": ["auc", ""],
"neg_bagging_fraction": 0.2,
},
train_dataset,
100,
)
X_tst = transform(tst, web_aggregate)
X_tst["timedelta_3"].clip(timedelta_3_clip_lower, inplace=True)
submission = pd.read_csv(SAMPLE_SUBMISSION_PATH, index_col="order_id")
submission["is_callcenter"] = model.predict(X_tst)
submission.to_csv(SUBMISSION_PATH)
submission
[LightGBM] [Warning] Met categorical feature which contains sparse values. Consider renumbering to consecutive integers started from zero [LightGBM] [Info] Number of positive: 37099, number of negative: 67496 [LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001702 seconds. You can set `force_row_wise=true` to remove the overhead. And if memory is not enough, you can set `force_col_wise=true`. [LightGBM] [Info] Total Bins 5430 [LightGBM] [Info] Number of data points in the train set: 104595, number of used features: 30 [LightGBM] [Info] [binary:BoostFromScore]: pavg=0.354692 -> initscore=-0.598478 [LightGBM] [Info] Start training from score -0.598478
| is_callcenter | |
|---|---|
| order_id | |
| 1350922 | 0.008653 |
| 1354989 | 0.012926 |
| 1352637 | 0.503872 |
| 1350050 | 0.751182 |
| 1341733 | 0.272487 |
| ... | ... |
| 1358397 | 0.231185 |
| 1357968 | 0.016199 |
| 1358835 | 0.990611 |
| 1365692 | 0.114166 |
| 1365429 | 0.007388 |
17196 rows × 1 columns
У катбуста я решил сделать модель на 20 деревьях и сделать идею с булевым page_type_3, чтобы не переобучится
X["timedelta_3"].clip(timedelta_3_clip_lower, inplace=True)
X["page_type_3"] = (X["page_type_3"] > 0).astype(int)
tr_pool = cb.Pool(X, y)
catboost = cb.train(
tr_pool, {"iterations": 20, "eval_metric": "AUC", "loss_function": "Logloss"}
)
X_tst = transform(tst, web_aggregate)
X_tst["timedelta_3"].clip(timedelta_3_clip_lower, inplace=True)
X_tst["page_type_3"] = (X_tst["page_type_3"] > 0).astype(int)
submission = pd.read_csv(SAMPLE_SUBMISSION_PATH, index_col="order_id")
submission["is_callcenter"] = catboost.predict(X_tst, prediction_type="Probability")[
:, 1
]
submission.to_csv(SUBMISSION_PATH)
submission
Learning rate set to 0.5 0: total: 5.21ms remaining: 99ms 1: total: 9.46ms remaining: 85.2ms 2: total: 14.1ms remaining: 79.9ms 3: total: 18.8ms remaining: 75.1ms 4: total: 23ms remaining: 69ms 5: total: 27.5ms remaining: 64.1ms 6: total: 31.8ms remaining: 59.1ms 7: total: 36.5ms remaining: 54.7ms 8: total: 40.5ms remaining: 49.5ms 9: total: 45.2ms remaining: 45.2ms 10: total: 49.4ms remaining: 40.4ms 11: total: 53.7ms remaining: 35.8ms 12: total: 57.7ms remaining: 31.1ms 13: total: 62.1ms remaining: 26.6ms 14: total: 66ms remaining: 22ms 15: total: 70.6ms remaining: 17.6ms 16: total: 74.7ms remaining: 13.2ms 17: total: 79.2ms remaining: 8.79ms 18: total: 83.3ms remaining: 4.38ms 19: total: 87.5ms remaining: 0us
| is_callcenter | |
|---|---|
| order_id | |
| 1350922 | 0.007815 |
| 1354989 | 0.011382 |
| 1352637 | 0.474446 |
| 1350050 | 0.663885 |
| 1341733 | 0.223427 |
| ... | ... |
| 1358397 | 0.244059 |
| 1357968 | 0.010922 |
| 1358835 | 0.971819 |
| 1365692 | 0.169072 |
| 1365429 | 0.005213 |
17196 rows × 1 columns